100% found this document useful (2 votes)
147 views347 pages

Bayesian Inference

Uploaded by

stickhacher
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
147 views347 pages

Bayesian Inference

Uploaded by

stickhacher
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 347

Bayesian Inference

Bayesian Inference: Theory, Methods, Computations provides a comprehensive coverage


of the fundamentals of Bayesian inference from all important perspectives, namely theory,
methods and computations.
All theoretical results are presented as formal theorems, corollaries, lemmas etc., furnished
with detailed proofs. The theoretical ideas are explained in simple and easily comprehensible
forms, supplemented with several examples. A clear reasoning on the validity, usefulness,
and pragmatic approach of the Bayesian methods is provided. A large number of examples
and exercises, and solutions to all exercises, are provided to help students understand the
concepts through ample practice.
The book is primarily aimed at first or second semester master students, where parts of the
book can also be used at Ph.D. level or by research community at large. The emphasis is
on exact cases. However, to gain further insight into the core concepts, an entire chapter is
dedicated to computer intensive techniques. Selected chapters and sections of the book can
be used for a one-semester course on Bayesian statistics.
Key Features:
• Explains basic ideas of Bayesian statistical inference in an easily comprehensible form
• Illustrates main ideas through sketches and plots
• Contains large number of examples and exercises
• Provides solutions to all exercises.
• Includes R codes.
Silvelyn Zwanzig is a Professor for Mathematical Statistics at Uppsala University. She stud-
ied Mathematics at the Humboldt University of Berlin. Before coming to Sweden, she was
Assistant Professor at the University of Hamburg in Germany. She received her Ph.D. in
Mathematics at the Academy of Sciences of the GDR. She has taught Statistics to undergrad-
uate and graduate students since 1991. Her research interests include theoretical statistics
and computer-intensive methods.
Rauf Ahmad is Associate Professor at the Department of Statistics, Uppsala University.
He did his Ph.D. at the University of Göttingen, Germany. Before joining Uppsala Univer-
sity, he worked at the Division of Mathematical Statistics, Department of Mathematics,
Linköping University, and at Biometry Division, Swedish University of Agricultural Sci-
ences, Uppsala. He has taught Statistics to undergraduate and graduate students since
1995. His research interests include high-dimensional inference, mathematical statistics,
and U-statistics.
Bayesian Inference
Theory, Methods, Computations

Silvelyn Zwanzig and Rauf Ahmad


Designed cover image: © Silvelyn Zwanzig and Rauf Ahmad

First edition published 2024


by CRC Press
2385 NW Executive Center Drive, Suite 320, Boca Raton FL 33431

and by CRC Press


4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN

CRC Press is an imprint of Taylor & Francis Group, LLC

© 2024 Silvelyn Zwanzig and Rauf Ahmad

Reasonable efforts have been made to publish reliable data and information, but the author and pub-
lisher cannot assume responsibility for the validity of all materials or the consequences of their use.
The authors and publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, access www.copyright.
com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA
01923, 978-750-8400. For works that are not available on CCC please contact mpkbookspermis-
sions@tandf.co.uk

Trademark notice: Product or corporate names may be trademarks or registered trademarks and are
used only for identification and explanation without intent to infringe.

ISBN: 978-1-0321-09497 (hbk)


ISBN: 978-1-0321-18093 (pbk)
ISBN: 978-1-0032-21623 (ebk)

DOI: 10.1201/9781003221623

Typeset in CMR10
by KnowledgeWorks Global Ltd.

Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
Contents

Preface ix

1 Introduction 1

2 Bayesian Modelling 5
2.1 Statistical Model 5
2.2 Bayes Model 10
2.3 Advantages 22
2.3.1 Sequential Analysis 22
2.3.2 Big Data 23
2.3.3 Hierarchical Models 25
2.4 List of Problems 27

3 Choice of Prior 29
3.1 Subjective Priors 31
3.2 Conjugate Priors 38
3.3 Non-informative Priors 50
3.3.1 Laplace Prior 50
3.3.2 Jeffreys Prior 53
3.3.3 Reference Priors 63
3.4 List of Problems 74

4 Decision Theory 78
4.1 Basics of Decision Theory 78
4.2 Bayesian Decision Theory 83
4.3 Common Bayes Decision Rules 85
4.3.1 Quadratic Loss 85
4.3.2 Absolute Error Loss 89
4.3.3 Prediction 91
4.3.4 The 0–1 Loss 93
4.3.5 Intrinsic Losses 95
4.4 The Minimax Criterion 97
4.5 Bridges 101
4.6 List of Problems 110

v
vi CONTENTS
5 Asymptotic Theory 113
5.1 Consistency 113
5.2 Schwartz’ Theorem 120
5.3 List of Problems 125

6 Normal Linear Models 126


6.1 Univariate Linear Models 127
6.2 Bayes Linear Models 131
6.2.1 Conjugate Prior: Parameter θ = β, σ 2 Known 131
6.2.2 Conjugate Prior: Parameter θ = (β, σ 2 ) 145
6.2.3 Jeffreys Prior 153
6.3 Linear Mixed Models 159
6.3.1 Bayes Linear Mixed Model, Marginal Model 164
6.3.2 Bayes Hierarchical Linear Mixed Model 165
6.4 Multivariate Linear Models 170
6.5 Bayes Multivariate Linear Models 173
6.5.1 Conjugate Prior 173
6.5.2 Jeffreys Prior 177
6.6 List of Problems 180

7 Estimation 183
7.1 Maximum a Posteriori (MAP) Estimator 186
7.1.1 Regularized Estimators 188
7.2 Bayes Rules 193
7.2.1 Estimation in Univariate Linear Models 195
7.2.2 Estimation in Multivariate Linear Models 198
7.3 Credible Sets 199
7.3.1 Credible Sets in Linear Models 203
7.4 Prediction 206
7.4.1 Prediction in Linear Models 210
7.5 List of Problems 212

8 Testing and Model Comparison 216


8.1 Bayes Rule 217
8.2 Bayes Factor 220
8.2.1 Point Null Hypothesis 225
8.2.2 Bayes Factor in Linear Model 227
8.2.3 Improper Prior 230
8.3 Bayes Information 232
8.3.1 Bayesian Information Criterion (BIC) 232
8.3.2 Deviance Information Criterion (DIC) 233
8.4 List of Problems 236
CONTENTS vii
9 Computational Techniques 240
9.1 Deterministic Methods 241
9.1.1 Brute-Force 241
9.1.2 Laplace Approximation 242
9.2 Independent Monte Carlo Methods 244
9.2.1 Importance Sampling (IS) 248
9.3 Sampling from the Posterior 252
9.3.1 Sampling Importance Resampling (SIR) 253
9.3.2 Rejection Algorithm 254
9.4 Markov Chain Monte Carlo (MCMC) 257
9.4.1 Metropolis–Hastings Algorithms 259
9.4.2 Gibbs Sampling 263
9.5 Approximative Bayesian Computation (ABC) 268
9.6 Variational Inference (VI) 274
9.7 List of Problems 280

10 Solutions 284
10.1 Solutions for Chapter 2 284
10.2 Solutions for Chapter 3 287
10.3 Solutions for Chapter 4 292
10.4 Solutions for Chapter 5 296
10.5 Solutions for Chapter 6 300
10.6 Solutions for Chapter 7 306
10.7 Solutions for Chapter 8 312
10.8 Solutions for Chapter 9 319

11 Appendix 323
11.1 Discrete Distributions 323
11.2 Continuous Distributions 324
11.3 Multivariate Distributions 325
11.4 Matrix-Variate Distributions 326

Bibliography 329

Index 333
Preface

Our main objective of writing this book is to present Bayesian statistics at


introductory level, in pragmatic form, specifically focusing on the exact cases.
All three aspects of Bayesian inference, namely theory, methods, and compu-
tations, are covered.

The book is typically written as a textbook, but it also contains ample mate-
rial for researchers. As textbook, it contains more material than needed for a
one-semester course, whereas the choice of chapters also depends on the back-
ground and interest of the students, and specific aims of the designed course.
A number of examples and exercises are embedded in all chapters, and so-
lutions to all exercises are provided, often with reasonable details of solution
steps.

Another salient feature of the book is cartoon-based depictions, aimed at a


leisurely comprehension of otherwise intricate mathematical concepts. For this
contribution, the authors are indebted to the first author’s daughter, Annina
Heinrich, who so meticulously and elegantly transformed arcane ideas into
amusing illustrations.

The book also adds a new piece of joint output in the authors’ repository of
long-term team work, mostly leading to research articles or book chapters.

Most material of the book stems from the Master and Ph.D. courses on
Bayesian Statistics over the last several years. The courses were offered
through the Department of Mathematics and the Center for Interdisciplinary
Mathematics (CIM), Uppsala University. Special thanks are due to the col-
leagues and students who have read the preliminary drafts of the book and
provided valuable feedback. Any remnants of mistakes or typos are of course
the authors’ responsibility, and any information in this regard will be highly
appreciated.

We wish the readers a relishing amble through the pleasures and challenges
of Bayesian statistics!

Silvelyn Zwanzig and Rauf Ahmad


December 2023, Uppsala, Sweden

ix
Chapter 1

Introduction

Bayesian theory has seen an increasingly popular trend over the recent few
decades. It can be ascribed to a combination of two factors, namely, the fast
growth of computational power, and the applicability of the theory to a wide
variety of real life problems.

Reverend Thomas Bayes (1702–1761), the English nonconformist Presbyterian


minister, laid the foundation stone of the theory in terms of a theorem which
duly carries his name in probability theory. For any two events, A and B,
defined in a sample space Ω, with marginal probabilities P(A), P(B) and
conditional probability P(B|A), the theorem gives the conditional probability

P(B|A)P(A)
P(A|B) = .
P(B)

The theorem, thus, updates unconditional (prior) probability of A, P(A), into


conditional (posterior) probabilities of A, P(A|B), using knowledge of B.

The theorem, however, remained unknown to the wider world until one of
Bayes’ closest friends, Thomas Price (1723–1791), posthumously dug it up
in 1764, meticulously edited it, and sent it to the Royal Society which both
Thomases were members of. Thomas Price also attached a detailed note with
the article, arguing for the value and worth of the theory Bayes had put forth
in it. For a detailed historical profile of the subject, and of Thomas Bayes, see
e.g., Bellhouse (2004) and Stigler (1982).

In this context, it is indeed interesting to note that the axiomatic approach


to probability theory, as developed by Kolmogorov, and used as a standard
probability framework today, emerged about two centuries after Bayes intro-
duced his theorem.

What was introduced by Bayes and popularized by Price, was later formal-
ized by Laplace, modernized by Jeffreys, and mathematicised by de Finnetti.
Harold Jeffreys rightly accoladed that the Bayes theorem is to the theory of

DOI: 10.1201/9781003221623-1 1
2 INTRODUCTION
probability what Pythagoras theorem is to geometry. Over time, Bayesian the-
ory has occupied a prominent place in the comity of researchers.

We may recall ourselves that Statistics is more a research tool than a disci-
pline like others, and as such, Statistics helps develop and apply techniques to
measure and assess uncertainty in inductive inference. This implies applica-
bility of Statistics in any research context involving random mechanism where
the researcher intends to inductively draw conclusions.

Frequentist way of inference, based on likelihood theory, is one way to achieve


this end. In likelihood theory, the uncertainty, captured through the data x,
is formulated in terms of densities or mass functions, f (x|θ), θ ∈ Θ, where Θ
is the parameter space, so that the underlying model can be stated as

P = {f (x|θ); θ ∈ Θ}.

The target of inference is the unknown parameter, θ, which generates the


data and is considered to be fixed. For Bayesian theory, on the other hand,
the uncertainty comes as a whole package, both in data - via likelihood – but
also in the unknown parameter. The former is formulated in terms of densities
or mass functions, f (x|θ), and the later in terms of setting prior distributions
on the parameters, denoted π(θ), so that a Bayes model includes both the
likelihood and the prior, i.e.,
{P, π}.
Thus, the modus operandi of the frequentist inference consists of the science of
collecting empirical evidence and the art of gleaning information from this ev-
idence in order to support or refute the conjectures made about the unknown
parameters. The modus operandi of the Bayesian inference consists of the art
of setting a prior and the science of updating the prior into the posterior, by
exploiting the information in data.

The addition of prior as a measure of uncertainty shifts the focus in Bayesian


theory, by considering the prior as the main source of information about the
parameter, where data, or likelihood, helps improve the prior into a posterior.
Applying Bayes theorem gives Posterior ∝ Likelihood × Prior; Formally,

π(θ|X) ∝ f (x|θ)π(θ).

The prior can be based on subjective beliefs. It mainly pertains to the fact, one
that cannot be overemphasized, that the scientific inquiry often incorporates
subjectively formulated conjectures as essential components in exploring the
nature. This can be evidenced from the use of subjectivism by world renowned
researchers excelled in a variety of scientific and philosophical domains. Press
and Tamur (2001) list up many of them, in the context of their subjectivism,
including giants such as Johannes Kepler, Gregor Mendel, Galileo Galilei,
INTRODUCTION 3
Isaac Newton, Alexander von Humboldt, Charles Darwin, Sigmund Freud,
and Albert Einstein.

The posterior follows therefore from an amalgamation of subjectively decided


prior and data-based likelihood, a combination of belief and evidence. By
this, the posterior logically tilts toward the component that carries the heav-
ier weight. One might in fact find oneself tempted to imagine that the prior
as belief in Bayesian inference possibly reflects the religious aptitude of the
founder of the idea.

Certain specific advantages of Bayesian theory makes it specifically attractive


for practical applications. In particular, although the prior can incorporate a
subjective component into inference, it also provides a flexible mode of infer-
ence by picking the most suitable model while trying out a variety of feasible
options. Further, a prior does not only reflect the researcher’s knowledge,
rather also ignorance, concerning the parameter.

The recent renaissance of Bayesian theory can indeed be considered from


another, historical and philosophical, perspective. In the eighties of the last
century, all possible and interesting models were considered, with explicit for-
mulas derived, new distribution families introduced and studied. At that time,
the splitting of the scientific community into Bayesian and fiducial statisti-
cians seemed to be complete and definite.

During the last decades, however, the more pragmatic Bayesian approach has
become popular, especially because of the iterative structure of the formulas
which allow adaptive applications. An objective Bayesian analysis is devel-
oped, where the prior is determined by information–theoretic criteria, avoid-
ing a subjective influence on the result of the study on one hand and exploring
the advantages of Bayes analysis on the other.

A major reason for this renaissance lies in the efficient computational sup-
port. The availability of computer-intensive methods such as MCMC (Markov
Chain Monte Carlo) and ABC (Approximative Bayesian Method) in the
modern era have led to a second coming of Bayesian theory. This has ob-
viously widened the spectrum of applications of Bayesian theory into the
modern areas of statistical inference involving big data and its associated
problems. The less reliance on mathematics, and more on computational al-
gorithms, has rightfully enhanced the scope of interest in Bayesian way of
inference for researchers and situations where mathematical intricacies can be
avoided.

This book is primarily designed as a textbook, to present the fundamentals


of Bayesian statistical inference, based on the Bayes model {P, π}. It covers
4 INTRODUCTION
basic theory, supplemented with methodological and computational aspects.
It may be emphasized that it is not our objective to weigh in for or against
Bayesian or frequentist inference. We do not even compare the two in any
precise context.

Our motivation stems from the fact that Bayesian inference provides a prag-
matic statistical approach to discern information from real life data. This
aspect indeed sets the book apart from many, if not all, others already in the
literature on the same subject, in that we focus on the cases, both in univari-
ate and multivariate inference, where exact posterior can be derived. Further,
the solutions of all exercises are furnished, to make the book self-contained.

The book can be used for a one-semester Bayesian inference course at the
undergraduate or master level, depending on the students’ background, for
which we expect a reasonable orientation to basic statistical inference, at the
level of e.g., Liero and Zwanzig (2011).

After this brief introduction, Chapter 2 provides general Bayesian modelling


framework, where the main principles are presented, and the difference be-
tween a usual statistical model P and a Bayesian model {P, π} is explained.
Chapter 3 specifically addresses the issue of setting a prior, its variants, and
their effects on the posterior. Methods for incorporating subjective informa-
tion and the ideas and principles of objective priors are discussed.

Chapter 4 focuses on statistical decision theory. It explains the connection


between Bayesian and frequentist forms of inference. It gives optimality prop-
erties of Bayes method for a statistical model P, and explains, vice versa, how
the frequentist methods can be expressed as Bayesian and how their optimal-
ity properties can be derived. Chapter 5 summarizes the main ingredients of
Bayesian asymptotic theory. Chapters 4 and 5 are rather technical and can
be skipped, if the reader so wishes, without loosing smooth transition to the
rest of the book.

Chapter 6 deals with Bayesian theory of linear models under normal distribu-
tion. It includes univariate linear models, linear mixed models and multivariate
models. The explicit formulas for the posteriors are derived. These results are
also important for evaluating simulation methods.

Chapter 7 is on general estimation theory of Bayesian inference. Chapter 8


treats the Bayesian approach for hypothesis testing. Chapter 9 adds essential
computational aspect of Bayesian theory, where the main ideas and princi-
ples of MC, MCMC, Importance sampling, Gibbs sampling and ABC are
presented, including R codes for the examples. Solutions to all exercises in
the book are given in Chapter 10. Apart from the exercises, all chapters are
equipped with detailed examples to explain the methods.
Chapter 2

Bayesian Modelling

The standard setting in statistics is that we consider the data x as a realization


(observation) of a random variable X. The data x may consist of numbers or
vectors, tables, categorical quantities or functions. The set of all possible ob-
servations X is named sample space and the distribution P X of the random
variable X is defined for measurable subsets of the sample space:

The data x ∈ X is a realization of X ∼ P X .

The distribution P X is unknown. It is called the true distribution or the


underlying distribution or the data generating distribution. The goal of math-
ematical statistics is to extract information on the underlying distribution P X
from the data x. This problem can only be solved when we can reduce the set
of all possible distributions over X . We have to assume that we have knowl-
edge about the underlying distribution. Formulating this type of assumptions
in a mathematical way is the purpose of modelling.
In this chapter we present the main principles of Bayesian modelling. First,
we explain the difference between a statistical and a Bayes model. Then we
introduce principles of determining a Bayes model.

2.1 Statistical Model


We start with the notion of a statistical model; see for example in Liero and
Zwanzig (2011).

Definition 2.1 A parametric statistical model is a set of probability


measures over X
P = {Pθ : θ ∈ Θ}.
The set Θ has finite dimension and it is called the parameter space.

Furthermore, we pose a strong condition that the postulated model is the


right one, i.e.,

DOI: 10.1201/9781003221623-2 5
6 BAYESIAN MODELLING

PX ∈ P. There exists θ0 such that PX = Pθ0 .

The parameter θ0 is the true parameter or the underlying parameter.

Example 2.1 (Flowers)


Snake’s head (Kungsängsliljan, Fritillaria meleagris) is one of the preserved
sights in Uppsala. In May, whole fields along the river are covered by the
flowers. They have three different colors: violet, white and pink. Suppose the
color of n randomly chosen flowers are determined as n1 white, n2 violet and
n3 pink. The (n1 , n2 , n3 ) triplets are realizations of the r.v. (N1 , N2 , N3 ), with
3
j=1 Nj = n, which has a multinomial distribution Mult(n, p1 , p2 , p3 ):

n!
P(N1 = n1 , N2 = n2 , N3 = n3 ) = p n1 p n 2 p n 3 , (2.1)
n1 ! n2 ! n3 ! 1 2 3
with n3 = n − n1 − n2 , p3 = 1 − p1 − p2 and 0 < pj < 1 for j = 1, 2, 3. The
unknown parameter θ = (p1 , p2 , p3 ) consists of the probabilities of each color.
With only three colors, we have p1 + p2 + p3 = 1. The parameter space is

Θ = {θ = (p1 , p2 , p3 ) : 0 < pj < 1, j = 1, 2, 3, p1 + p2 + p3 = 1}.

Thus
P = {Mult(n, p1 , p2 , p3 ) : θ = (p1 , p2 , p3 ) ∈ Θ}.
2
Example 2.2 (Measurements)
Suppose we carry out a physical experiment. Under identical conditions we
take repeated measurements. The data consist of n real numbers xi , i =
1, . . . , n and X = Rn . The sources of randomness are the measurement errors,
which can be assumed as normally distributed. Thus we have a sample of i.i.d.
r.v.’s and the model is given by

P = {N(μ, σ 2 )⊗n : (μ, σ 2 ) ∈ R × R+ }.

Let us consider a different model for the experiment above.

Example 2.3 (Measurements with no distribution specification)


We consider the same experiment as above. But the sources of randomness are
unclear. We are only interested in an unknown constant, namely the expected
value. Further we assume that the variance is finite and the density f of PX
is unknown. In this case we can formulate the model for the distribution of
sample of i.i.d. r.v.’s (X1 , . . . , Xn ) by

P = {P⊗n
θ : EX = μ, VarX = σ 2 , (μ, σ 2 ) ∈ R × R+ , f ∈ F}.
STATISTICAL MODEL 7
2
The unknown parameter θ consists of μ, σ , f and the parameter space Θ is
determined as R × R+ × F. In this case, F is some functional space of infinite
dimension. This model does not fulfill the conditions of a statistical model
in Definition 2.1. Otherwise when we suppose that F is a known parametric
distribution family with density f (x) = fθ (x), θ = (μ, σ 2 , ϑ), ϑ ∈ Rq , then the
parameter space has a dimension less or equal q + 2 and the model fulfills the
conditions of Definition 2.1. 2

Let us consider one more example which is probably not realistic but still
a good classroom example for understanding purposes; see also Liero and
Zwanzig (2011).

Example 2.4 (Lion’s appetite)


Suppose that the appetite of a lion has three different stages:

θ ∈ Θ = {hungry, moderate, lethargic} = {θ1 , θ2 , θ3 }.

At night the lion eats x people with a probability Pθ (x) given by the following
table:
x 0 1 2 3 4
θ1 0 0.05 0.05 0.8 0.1
.
θ2 0.05 0.05 0.8 0.1 0
θ3 0.9 0.05 0.05 0 0
Thus the model consists of three different distributions over X = {0, 1, 2, 3, 4}.
2

The Corona crisis, started in early 2020, led to an increased interest in statis-
tics. We consider an example related to Corona mortality statistics.

Example 2.5 (Corona) The data consist of the number of deaths related to
an infection with COVID-19 in the ith county in Sweden i = 1, . . . , 17, during
the first wave in week 17 (end of April) in 2020. We assume a logistic model
for the probability of death p by COVID-19, depending on the covariates x1
(age), x2 (gender), x3 (health condition) and x4 (accommodation). Thus,
exp(α1 x1 + α2 x2 + α3 x3 + α4 x4 )
p= .
1 + exp(α1 x1 + α2 x2 + α3 x3 + α4 x4 )
2

Given a statistical model P = {Pθ : θ ∈ Θ}, the main tool for exploring the
information in the data is the likelihood function.
For convenience, we introduce the probability function to handle both dis-
crete and continuous distributions because measure theory is not a prerequisite
of our book. Let A ⊆ X . For a continuous r.v. X with density f (·|θ)

Pθ (A) = f (x|θ)dx,
A
8 BAYESIAN MODELLING
and for a discrete r.v. 
Pθ (A) = Pθ ({x}),
x∈A

the probability function is defined by



f (x|θ) if Pθ is continuous
p(x|θ) = .
Pθ ({x}) if Pθ is discrete

Definition 2.2 (Likelihood function) For an observation x of a r.v. X


with p(·|θ), the likelihood function (·|x) : Θ → R+ is defined by

(θ|x) = p(x|θ).

If X = (X1 , . . . , Xn ) is a sample of independent r.v.’s, then


n
(θ|x) = Pi,θ (xi ) in the discrete case (2.2)
i=1

and

n
(θ|x) = fi (xi |θ) in the continuous case, (2.3)
i=1

where Xi is distributed according to Pi,θ and fi (·|θ), respectively. The like-


lihood principle says that all information is contained in the likelihood
function and a statistical procedure should be based on maximizing the likeli-
hood function. Let us quote two variants of the likelihood principle. In Robert
(2001, p.16), the likelihood principle is formulated as following:

“The information brought by an observation x is entirely contained


in the likelihood function (θ|x).
Moreover, if x1 and x2 are two observations depending on the same
parameter θ, such that there exists a constant c satisfying

1 (θ|x1 ) = c 2 (θ|x2 ), for every θ,

then they bring the same information about θ and must lead to
identical inferences.”

Example 2.6 (Bernoulli trials)


Consider two different sampling strategies for the sequences of Bernoulli trials

(X1 , . . . , Xn , . . .), i.i.d. from X ∼ Ber(p).


STATISTICAL MODEL 9
First the number of trials are fixed by n and the number of successes x1 is
counted. Then
n
X1 = Xi ∼ Bin(n, p),
i=1

with likelihood function at θ = p


 
n x1
1 (θ|x1 ) = θ (1 − θ)n−x1 ∝ θx1 (1 − θ)n−x1 .
x1

The other strategy is to fix the number k of successes and observe the sequence
until k successes are attained. The number of failures x2 is counted, where x2
is a realization of X2 which follows a negative binomial distribution:

X2 ∼ NB(k, p),

with likelihood function at θ = p


 
x2 + k − 1 k
2 (θ|x2 ) = θ (1 − θ)x2 ∝ θk (1 − θ)x2 .
k−1

Suppose 20 Bernoulli trials give 5 successes. Then x1 = 5 and x2 = 15, and


we get
1 (x1 |θ) ∝ θ5 (1 − θ)15 ∝ 2 (x2 |θ).
The likelihood principle is fulfilled, as both sampling strategies yield the same
result. 2

The maximum likelihood principle is alternatively formulated in Lindgren


(1962, p. 225) as follows:

“A statistical inference or procedure should be consistent with


the assumption that the best explanation of a set of data x is
provided by θMLE a value of θ that maximizes (θ|x).”

Example 2.7 (Lion’s appetite)


For x = 3, the likelihood function in lion’s example is:

θ1 θ2 θ3
.
(θ|x = 3) 0.8 0.1 0

This leads to the conclusion that a lion, having eaten 3 persons, was hungry.2

Let us consider the likelihood function of a normal sample.


10 BAYESIAN MODELLING
Example 2.8 (Normal sample)
Let X1 , . . . , Xn be i.i.d. r.v.’s according to N(μ, σ 2 ). Then for θ = (μ, σ 2 )

n  
1 1
(θ|x) = √ exp − 2 (xi − μ)2
i=1
2πσ 2σ

1 
n
∝ (σ 2 )− 2 exp −
n
(xi − μ)2 .
2σ 2 i=1
n
Since i=1 (xi − x)(x − μ) = 0, we obtain


n 
n 
n
(xi − μ)2 = (xi − x)2 + (x − μ)2 .
i=1 i=1 i=1

Using the sample variance

1 
n
s2 = (xi − x)2 ,
n − 1 i=1

it follows that the likelihood function depends only on the sufficient statistic
(x, s2 ) since
 
1
(θ|x) ∝ (σ 2 )− 2 exp − 2 ((n − 1)s2 + n(x − μ)2 ) .
n


In particular, it holds that

(θ|x) ∝ (μ|σ 2 , x)(σ 2 |x),

with
n
(μ|σ 2 , x) ∝ exp − (x − μ)2
2σ 2
and  
(n − 1)s2
(σ 2 |x) ∝ (σ 2 )− 2 exp −
n
.
2σ 2
The likelihood function of μ given σ 2 has the form of the Gaussian bell curve
with center x and inflection points at x − √1n σ, x + √1n σ. The likelihood func-
tions of μ and σ 2 are plotted in Figure 2.1. 2

2.2 Bayes Model


The Bayes model sharpens the statistical model with an essential additional
assumption. It consists of two parts:
• The underlying unknown parameter θ ∈ Θ is supposed to be a realization
of a random variable with the distribution π over Θ.
BAYES MODEL 11

1000 1200
4
3

800
likelihood

likelihood

600
2

400
1

200
0

S1 sample S2
mean

0
S

0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4


0.0 0.1 0.2 0.3 0.4 0.5 0.6
expectation
variance

Figure 2.1: Likelihood functions in Example 2.12. Left: Likelihood function (μ|σ 2 , x)
with maximum at the sample mean x and inflection points S1 = x − √1n σ and
S2 = x + √1n σ. Right: Likelihood function (σ 2 |x) with maximum at S = n−1 n
s2 ,
where s2 is the sample variance.

• The distribution π is known,


θ ∼ π. (2.4)
In literature we can find discussions on the meaning of (2.4). Some papers
give philosophical interpretations up to religious foundations.
Here we follow a pragmatic point of view: We do not speculate who has car-
ried out the parameter generating experiment. We just state that assumption
(2.4) is not easy to interpret; may be it is much stronger than we can imagine,
but it makes life much easier.

In Bayesian inference we have two random variables: θ and X, and both play
different roles. The random variable θ is not observed, but the parameter
generating distribution π is known. The random variable X is observed.
The data generating distribution is the conditional distribution of X given
θ and it is known up to θ.

Summarizing we have the following definition:

Definition 2.3 (Bayes model) A Bayes model {P, π} consists of a set


of conditional probability distributions over X , P = {Pθ : θ ∈ Θ}, where
PX|θ = Pθ and one distribution π over Θ.
The set Θ has finite dimension and is called the parameter space. The
distribution π over Θ is called the prior distribution.
12 BAYESIAN MODELLING
Furthermore, we impose a strong condition that the Bayes model is the right
one, i.e.,

The random variable (θ, X) has a distribution P(θ,X) where


Pθ = π and PX|θ ∈ P.

The choice of the prior distribution is a very important part of Bayesian mod-
elling. Different approaches and principles for choosing a prior are presented
in Chapter 3.
The following historical example of Thomas Bayes is used to illustrate the
construction of the parameter generating experiment and the data generating
experiment; see the details in Stigler (1982).

Example 2.9 (Bayes’ billiard table)


Consider a flat square table with length 1. It has no pockets. Bayes never
specified it as a billiard table, but under this name the example is now well
known in literature. The first ball W is rolled across the table and stopped
at an arbitrary place, uniformly distributed over the unit square. The table
is divided vertically through the position of the ball W. The position of W is
not saved. A second ball O is rolled across the table in the same manner n
times. It is counted how often the second ball O comes to rest to the left of
W. Translating this description we have that the horizontal position of W is
the underlying parameter θ with prior distribution U(0, 1). The data x is the
number of successes, where the probability of success equals θ. Thus
θ ∼ U(0, 1), X|θ ∼ Bin(n, θ).
2
Note, the parameter is generated only one time, whereas the data can be
generated repeatedly many times. The notion of an i.i.d. experiment X =
(X1 , . . . , Xn ) in a Bayes set up is related to the conditional distribution of X
given θ. The r.v. (θ, X1 , . . . , Xn ) is not an i.i.d. sample. Especially the r.vs.
Xi1 , Xi2 are not independent.
We demonstrate it by the following example.

Example 2.10 (Normal i.i.d. sample)


Consider an i.i.d. sample from N(θ, σ 2 ) with σ 2 known. The parameter of
interest is θ with normal prior
θ ∼ N(μ0 , σ02 ),
where μ0 and σ02 are known. Further, set the sample size n = 2. Then the
random vector (θ, X1 , X2 ) is three dimensional normally distributed with
EXi = Eθ (E(Xi |θ)) = Eθ = μ0 ,
BAYES MODEL 13

Figure 2.2: The billiard table for Example 2.9.

VarXi = Eθ (Var(Xi |θ)) + Varθ (E(Xi |θ)) = VarXi + Var θ = σ 2 + σ02 .


Using the relation

Cov(U, Z) = EY (Cov((U, Z)|Y )) + CovY (E(U |Y ), E(Z|Y )), (2.5)

we have

Cov(X1 , X2 ) = Eθ (Cov((X1 , X2 )|θ)) + Covθ (E(X1 |θ), E(X2 |θ)).

As the data are conditionally i.i.d., it implies that Cov((X1 , X2 )|θ) = 0, so


that
Cov(X1 , X2 ) = Covθ (E(X1 |θ), E(X2 |θ)) = Var θ = σ02 .
Further,

Cov(X1 , θ) = E(X1 θ) − μ20 = Eθ (E(X1 θ|θ)) − μ20 = Var θ = σ02 .

Summarizing, we obtain
⎛ ⎞ ⎛⎛ ⎞ ⎛ 2 ⎞⎞
θ μ0 σ0 σ02 σ02
⎝X1 ⎠ ∼ N ⎝⎝μ0 ⎠ , ⎝σ02 σ0 + σ 2
2
σ02 ⎠⎠ ,
X2 μ0 σ02 σ02 σ0 + σ 2
2
14 BAYESIAN MODELLING

Figure 2.3: From prior to posterior via experiment.

which is not the distribution of an i.i.d. sample. But we still have


     2 
X1 θ σ 0
|θ ∼ N , .
X2 θ 0 σ2
2

In fact, we are not interested in the joint distribution of (θ, X); our interest is
to learn about the underlying data generating distribution PX|θ , the main tool
for which is the conditional distribution of θ given X = x. The information
on the parameter based on prior and on the experiment is included in the
conditional distribution of θ given x, denoted by π(.|x) and called posterior
distribution. Roughly speaking the posterior distribution takes over the role
of the likelihood. We formulate the Bayesian inference principle as follows.

“The information on the underlying parameter θ is entirely


contained in the posterior distribution.
All statistical conclusions are based on π(θ|x).”

In other words, when we know π(.|x), we can do all inference, including esti-
mation and testing. The derivation of the posterior distribution is the first step
in a Bayes study, and the main tool to arrive at this is the Bayes Theorem.
It holds that
P(θ,X) = Pθ PX|θ = PX Pθ|X .
BAYES MODEL 15
Thus
Pθ PX|θ
Pθ|X = .
PX
Assuming that the data and the parameter have continuous distributions, we
rewrite this relation using the respective densities:

π(θ)f (x|θ)
π(θ|x) =
f (x)

where f (x|θ) is the likelihood function (θ|x). The joint density of (θ, X) can
be written as
f (θ, x) = f (x|θ)π(θ),
where f (x) is the density of the marginal distribution, i.e.,

f (x) = f (x|θ)π(θ)dθ. (2.6)
Θ

Hence for (θ|x) = f (x|θ), we have

π(θ)(θ|x)
π(θ|x) =  ∝ π(θ)(θ|x).
Θ
(θ|x)π(θ)dθ

In case of a discrete distribution π over Θ, we have



f (x) = f (x|θ)π(θ).
θ∈Θ

and
π(θ)(θ|x)
π(θ|x) =  ∝ π(θ)(θ|x),
θ∈Θ (θ|x)π(θ)

where we use (θ|x) = Pθ ({x}). In each case the most important relation for
determining the posterior distribution is:

π(θ|x) ∝ π(θ)(θ|x)

The product π(θ)(θ|x) is the kernel function of the posterior. It includes the
prior information (π) and the information from the data ((.|x)). The posterior
distribution is then determined up to a constant. To determine the complete
posterior, we can apply different methods.
• Compare the kernel function with known distribution families,
• Calculate the normalizing constant analytically,
• Calculate the normalizing constant with Monte Carlo methods,
• Generate a sequence θ(1) , . . . , θ(N ) which is distributed as π(θ|x), and
• Generate a sequence θ(1) , . . . , θ(N ) which is distributed approximately as
π(θ|x).
16 BAYESIAN MODELLING
The computer intensive methods on Bayesian computations are included in
Chapter 9. Hence we illustrate the first three items.

Example 2.11 (Binomial distribution and beta prior)


The Bayes model related to the historical billiard table Example 2.9 is given
by θ ∼ U(0, 1), X|θ ∼ Bin(n, θ) with the likelihood function
 
n x
(θ|x) = θ (1 − θ)n−x ∝ θx (1 − θ)n−x .
x

The density of U(0, 1) is constant and equals one. Hence the posterior distri-
bution is determined by

π(θ|x) ∝ θx (1 − θ)n−x .

This is the kernel of a beta distribution Beta(1 + x, 1 + n − x). The beta


distribution is continuous, defined over the interval [0, 1], and depends on
positive parameters α and β. Its density is given by

f (x|α, β) = B(α, β)−1 xα−1 (1 − x)β−1 , (2.7)

where the normalizing constant B(α, β) is the beta function. Note that, U(0, 1)
is the beta distribution with α = 1, β = 1. In a more general case where the
prior is a beta distribution Beta(α0 , β0 ), we have

π(θ|x) ∝ θα0 −1 (1 − θ)β0 −1 θx (1 − θ)n−x


∝ θα0 +x−1 (1 − θ)β0 +n−x−1

and
θ|x ∼ Beta(α0 + x, β0 + n − x).
2

Example 2.12 (Normal i.i.d. sample and normal prior)


Consider an i.i.d. sample X = (X1 , . . . , Xn ) from N(μ, σ 2 ) with known vari-
ance σ 2 . The unknown parameter is θ = μ. We assume that μ is a realization
of normal distribution N(μ0 , σ02 ). We have

1 
n
(μ|x) ∝ exp − (xi − μ)2
2σ 2 i=1

and  
1
π(μ) ∝ exp − 2 (μ − μ0 ) .
2
2σ0
BAYES MODEL 17
3.5

3.5
beta(7,5)
3.0

3.0
beta(1,1)
beta(4,2) beta(10,6)
2.5

2.5
posterior density
prior density

2.0

2.0
1.5

1.5
1.0

1.0
0.5

0.5
0.0

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

theta theta

Figure 2.4: Example 2.11. Left: Prior distributions. Right: Posterior distributions,
after observing x = 6 from Bin(10, θ).

Thus
1 
n
1
π(μ|x) ∝ exp − (xi − μ)2 − 2 (μ − μ0 )2 .
2σ 2 i=1 2σ0
n
Using the identity i=1 (xi − μ)2 = (n − 1)s2 + n(μ − x̄)2 , where s2 is the
sample variance
1 
n
s2 = (xi − x)2
n − 1 i=1
we obtain  
n 1
π(μ|x) ∝ exp − 2 (μ − x̄) − 2 (μ − μ0 ) .
2 2
2σ 2σ0
By completing the squares, we get
 2
nσ02 + σ 2 x̄nσ02 + μ0 σ 2
π(μ|x) ∝ exp − μ− .
2σ02 σ 2 nσ02 + σ 2

This is the kernel function of a normal distribution, so that the posterior


distribution is:
x̄nσ02 + μ0 σ 2 σ02 σ 2
N(μ1 , σ12 ), with μ1 = 2 and σ12 = . (2.8)
nσ0 + σ 2 nσ02 + σ 2
We see that the expectation is a weighted average of the prior mean and
the sample mean. The variance of the posterior distribution σ12 vanishes as n
approaches infinity. The prior and posterior distributions are given in Figure
2.5, where the posterior distribution is a compromise between the information
from the experiment and the prior. 2
18 BAYESIAN MODELLING

3.5
3.0
2.5
prior
likelihood
2.0
1.5
1.0
0.5
0.0 posterior

0.5 1.0 1.5 2.0 2.5 3.0

theta

Figure 2.5: Example 2.12. Illustration of the posterior distribution as compromise


between prior information and experimental information.

Let us consider a similar situation as in Example 2.12, but now focusing on


an inference about the precision parameter τ = σ12 .

Example 2.13 (Normal i.i.d. sample and gamma prior)


Consider an i.i.d. sample X = (X1 , . . . , Xn ) from N(0, σ 2 ) with unknown
variance σ 2 . The parameter of interest is the precision parameter θ = τ =
σ −2 . As prior distribution of τ , we take a gamma distribution denoted by
Gamma(α, β), with the density

β α α−1
f (τ |α, β) = τ exp(−βτ ) (2.9)
Γ(α)

where α > 0 is the shape parameter and β > 0 is the rate parameter. In this
case, we have
τ 2
n
n
(τ |x) ∝ τ 2 exp − x
2 i=1 i
and
π(τ |α, β) ∝ τ α−1 exp(−βτ ).
Then the posterior distribution has the kernel

1 2
n
n
π(τ |x, α, β) ∝ τ α−1+ 2 exp −τ ( x + β) .
2 i=1 i
BAYES MODEL 19
Hence the posterior distribution is also
n gamma distribution with shape pa-
rameter α + n2 and rate parameter 12 i=1 x2i + β:

n 1 2
n
Gamma(α + , x + β). (2.10)
2 2 i=1 i

Let us consider the same set up once more, where now the parameter of interest
is the variance.

Example 2.14 (Normal i.i.d. sample and inverse-gamma prior)


We consider the same sample as in Example 2.13. The unknown parameter
is the variance θ = σ 2 . Let the prior distribution of θ be an inverse-gamma
distribution InvGamma(α, β) with shape parameter α and scale parameter β,
which has the density
 α+1  
βα 1 β
f (θ|α, β) = exp − . (2.11)
Γ(α) θ θ

Note that, if X ∼ Gamma(α, β), then X −1 ∼ InvGamma(α, β). We have


  n2
1  2
n
1
(θ|x) ∝ exp − x
θ 2θ i=1 i

and  α+1  
1 β
π(θ|α, β) ∝ exp − .
θ θ
Then the posterior distribution has the kernel
 α+1+ n2  1
n 
1 β+ x2i
π(θ|x, α, β) ∝ exp − 2 i=1
.
θ θ

Hence the posterior distribution is the inverse-gamma


n distribution with shape
parameter α + n2 and scale parameter 12 i=1 x2i + β (see Figure 2.6), i.e.,

n 1 2
n
σ 2 |x ∼ InvGamma(α + , x + β). (2.12)
2 2 i=1 i

We revisit the lion example to illustrate a case where the constant can be
easily calculated.
20 BAYESIAN MODELLING

0.8
prior
likelihood

0.6
posterior
0.4
0.2
0.0

0 1 2 3 4 5

variance

Figure 2.6: Example 2.14. The posterior distribution is more concentrated and
slightly shifted towards the true parameter.

Example 2.15 (Lion’s appetite)


For x = 3 the likelihood function is
θ1 θ2 θ3
.
(θ|x = 3) 0.8 0.1 0

The experience says that an adult male lion is lethargic with probability 0.8
where the probability that he is hungry is 0.1.

θ1 θ2 θ3
.
π(θ) 0.1 0.1 0.8

Thus

π(θ1 |x = 3) ∝ (0.8)(0.1), π(θ2 |x = 3) ∝ (0.1)(0.1) π(θ1 |x = 3) ∝ (0.8)(0).

The normalizing constant is 0.09, and we obtain for the posterior distribution

θ1 θ2 θ3
.
π(θ|x = 3) 0.889 0.111 0

After knowing that the lion has eaten 3 persons, the probability that he was
hungry is really high; no surprise! 2

The following example is chosen to illustrate the need of computer intensive


methods to calculate the posterior; see Figure 2.7.
BAYES MODEL 21

1.0
prior
likelihood

0.8
likelihood*prior
posterior
0.6
0.4
0.2
0.0

−1 0 1 2 3 4 5

theta

Figure 2.7: Example 2.16. The posterior distribution is normalized by an approxi-


mation of m(x) .

Example 2.16 (Cauchy i.i.d. sample and normal prior)


We consider an i.i.d. sample X = (X1 , . . . , Xn ) from a Cauchy distribution
C(θ, γ) with location parameter θ and scale parameter γ > 0. The density of
C(θ, γ) is
1
f (x|θ, γ) = . (2.13)
πγ 1 + ( x−θ
γ )
2

The parameter of interest is θ. We set γ = 1. Thus the likelihood is given by



n
1
(θ|x) ∝ .
i=1
1 + (xi − θ)2

As prior distribution of θ we take the normal distribution N(μ, σ 2 ), where μ


and σ 2 are given. Using π(θ|x) ∝ (θ|x)π(θ), we obtain


n  
1 (θ − μ)2
π(θ|x) ∝ exp − . (2.14)
i=1
1 + (xi − θ)2 2σ 2

The normalizing constant


 ∞ 
n  
1 (θ − μ)2
m(x) = exp − dθ
−∞ i=1 1 + (xi − θ)
2 2σ 2

cannot be integrated analytically. Here in the introductory example we apply


22 BAYESIAN MODELLING
the independent Monte Carlo method, where m(x) is taken as expected value
of
√ n
1
g(θ) = 2πσ
i − θ)
1 + (x 2
i=1

with respect to θ ∼ N(μ, σ 2 ). A random number generator is used to generate


independently a sequence θ(1) , . . . , θ(N ) from N(μ, σ 2 ). Then the integral is
approximated by
1 
N
m(x) = g(θ(j) )
N j=1

and the posterior is approximated by


 
1 
n
1 (θ − μ)2
(θ|x) =
π exp − .
m(x) i=1 1 + (xi − θ)2 2σ 2

2.3 Advantages
In this section we present three settings, where the Bayesian approach gives
easy computations in many useful models.

2.3.1 Sequential Analysis


Suppose we have successive data sets xt1 , . . . , xtn related to the same param-
eter:
xtj , realization of Xtj ∼ P Xtj ∈ {Pj,θ ; θ ∈ Θ} .
Further, we assume a common prior π:

θ ∼ π.

The posterior distribution based on the first data set is calculated by

π(θ|xt1 ) ∝ π(θ)1 (θ|xt1 ),

where 1 (θ|xt1 ) is the likelihood function corresponding to P1,θ . Knowing the


results of the first experiment xt1 we take the posterior distribution π(θ|xt1 )
as new prior and calculate

π(θ|xt1 , xt2 ) ∝ π(θ|xt1 )2 (θ|xt2 ),

where 2 (θ|xt2 ) is the likelihood function corresponding to P2,θ . Continuing


similarly for each new independent data set, we finally obtain the posterior
distribution based on all data sets

π(θ|xt1 , . . . , xtn ) ∝ π(θ)(θ|xt1 , . . . , xtn ),


ADVANTAGES 23
where (θ|xt1 , . . . , xtn ) is the likelihood corresponding to P1,θ ×. . .×Pn,θ . This
procedure can also be generalized for cases with dependent data sets, using
the likelihood functions of the conditional distributions Xtj |(Xt1 , . . . , Xt(j−1) ).

Example 2.17 (Sequential data)


Consider i.i.d. data sets (x1 , x2 , x3 ), (y1 , y2 ), (z1 , z2 , z3 , z4 ), independent of
each other, where each single observation is a realization of N(θ, σ 2 ), σ 2 known.
Suppose the prior θ ∼ N(μ0 , σ02 ). Then the posterior distribution after the first
data set is N(μ1 , σ12 ) with
3
xi σ02 + μ0 σ 2 σ02 σ 2
μ1 = i=1
2 and σ12 =
3σ0 + σ 2 3σ02 + σ 2 .

The posterior distribution after the first two data sets is N(μ2 , σ22 ) with

(y1 + y2 )σ12 + μ1 σ 2 2 σ12 σ 2


μ2 = and σ 2 = .
2σ12 + σ 2 2σ12 + σ 2

Finally, the posterior distribution after all three data sets is N(μ3 , σ32 ) with
4 2 2
i=1 zi σ2 + μ2 σ σ22 σ 2
μ3 = 2 and σ32 = .
4σ2 + σ 2 4σ22 + σ 2

See Figure 2.8. 2

2.3.2 Big Data


Another set up, where the Bayesian approach is really helpful, is the case where
a huge amount of data is observed. The underlying data generating procedure
is complicated, where the dimension p of the parameter space Θ exceeds the
dimension n of the sample space. Using additional prior information makes
the situation manageable. We demonstrate it by the following example.

Example 2.18 (Big Data)


Consider the linear model,
Y = Xβ + , (2.15)
where the n × 1 vector Y is observed, the n × p matrix X is known. The
unknown regression parameter is the p × 1 vector β. The n × 1 error vector
is unobservable. We assume

∼ Nn (0, σ 2 In ).

Further we suppose that the variance σ 2 is known. The parameter of interest


is θ = β ∈ Rp . Under the big data set up n < p, the n × p matrix X does
24 BAYESIAN MODELLING

1.2
prior
posterior−1
posterior−2

1.0
posterior−all
0.8

true value
0.6
0.4
0.2
0.0

−2 0 2 4 6 8 10

theta

Figure 2.8: Example 2.17. The iteratively calculated posterior distributions are more
and more concentrated around the unknown underlying parameter.

not have full rank so that the inverse of XT X does not exist. The likelihood
function
  n2  
1 1
(θ|Y) ∝ exp − (Y − Xβ) T
(Y − Xβ) , (2.16)
σ2 2σ 2

has no unique maximum. The maximum likelihood estimators β are the solu-
tions of
XT Xβ = XT Y.
Using the Bayesian approach we assume a normal prior

β ∼ Np (β0 , σ 2 Σ0 ),

where Σ0 is positive definite, and obtain for the posterior


 
1  T −1

π(β|Y) ∝ exp − (Y − Xβ) (Y − Xβ) + (β − μ0 ) Σ0 (β − μ0 ) .
T
2σ 2
Completing squares we get

(Y − Xβ)T (Y − Xβ) + (β − β0 )T Σ−1


0 (β − β0 )
= (β − β1 )T Σ−1 T −1
1 (β − β1 ) + (Y − Xβ1 ) (Y − Xβ1 ) + (β0 − β1 ) Σ0 (β0 − β1 )
T

with  −1  T 
β1 = XT X + Σ−1
0 X Y + Σ−1
0 β0
ADVANTAGES 25
and
Σ−1 T −1
1 = X X + Σ0 .

Hence the posterior distribution is


 
1
π(β|Y) ∝ exp − 2 (β − β1 )T Σ−1
1 (β − β 1 ) ,

i.e.,
β|Y ∼ Np (β1 , σ 2 Σ1 ).
Exploring the posterior instead of the likelihood function we avoid the com-
plications due to n < p, since in this case the inverse of XT X does not exist,
but the inverse of Σ1 does. 2

2.3.3 Hierarchical Models


Hierarchical models are based on stepwise conditioning. Consider non–
identically distributed data sets x1 , . . . , xm , where each data set xj , j =
1, . . . , m is the realization of a variable Xj with Xj ∼ Pθj . We assume

θ1 , . . . , θm are i.i.d. p(.|α) distributed.

In this case the distribution p(.|α) is a member of a parametric family, with


parameter α. The dimension of α is less than the dimension of θ = (θ1 , . . . , θm ).
It is a question of taste to consider p(.|α) as a prior distribution or as part
of a hierarchical modelling. Furthermore the hyperparameter α can also be
modelled as a realization of a prior

α ∼ πμ 0 ,

where μ0 is known. Setting x = (x1 , . . . , xm ), the posterior distribution can


be calculated as marginal distribution of

π(θ, α|x) ∝ p(θ|α)πμ0 (α)(θ|x),

which gives
π(θ|x) ∝ π(θ)(θ|x)
where 
π(θ) = π(θ|α)πμ0 (α)dα.

The next example, taken from Dupuis (1995), illustrates a hierarchical model.

Example 2.19 (Capture–recapture)


In Dupuis (1995), a multiple capture–recapture experiment is analyzed by
using the Arnason-Schwarz model. In Cévennes, France, on the Mont Lozère,
26 BAYESIAN MODELLING

Figure 2.9: Capture–recapture for Example 2.19.

the migration behaviour of lizards, Lacerta vivipara, is studied. The lizard i,


i = 1, . . . , n is captured (recaptured) and individually marked and returned
in the stratum r = 1, . . . , k at times j = 1, . . . , 6 (twice per year 1989, 1990,
1991). The number of marked juveniles is 96. The data for each animal i
consists of yi = (xi , zi ) with xi = (x(i,1) , . . . , x(i,6) ) and zi = (z(i,1) , . . . , z(i,6) ),
where

1 if lizard i is captured at time j
x(i,j) =
0 otherwise

r if lizard i is captured in stratum r at time j
z(i,j) =
0 otherwise

Further it is assumed that yi = (xi , zi ), i = 1, . . . , n, are i.i.d. The parameters


are
pj (r) = probability that a lizard will be captured in stratum r at time j.
qj (r, s) = transition probability for moving from r to s at time j.

It is assumed that
qj (r, s) = φj (r)ψj (r, s)
where φj (r) is the survival probability and ψj (r, s) is the probability of moving
from r to s. The parameter of interest is given by

θ = (p, φ, ψ),
LIST OF PROBLEMS 27
where p = (p1 (1), . . . , p6 (k)), φ = (φ1 (1), . . . , φ6 (k)) and ψ =
(ψ1 (1, 1), . . . , ψ6 (k, k)). Following assumptions are set:

pj (r) ∼ Beta(a, b), i.i.d.

φj (r) ∼ Beta(α, β), i.i.d.


Set ψj (r) = (ψj (r, 1), . . . , ψj (r, k)) for the probabilities of moving from the
location r at time j. They are assumed to follow Dirichlet distribution,

ψj (r) ∼ Dirk (e1 , . . . , ek ), i.i.d.

The Dirichlet distribution is a generalization of the beta distribution. In


general, the probability function of X ∼ Dirk (p1 , . . . , pk ) is given for x =
k
(x1 , . . . , xk ) with i=1 xi = 1. It depends on k positive parameters p1 , . . . , pk .
k
Setting i=1 pi = p0 ,

Γ(p0 )
p(x|p1 , . . . , pk ) = xp1 −1 . . . xpkk −1 .
Γ(p1 ) . . . Γ(pk ) 1

The components of ψj (r) are dependent. The other parameters are assumed to
be independent. The hyperparameters e1 , . . . , ek are independent of the time
j and the stratum s. In the study the experimenters set all hyperparameters
as known, using the knowledge on the behavior of the lizards. The Bayesian
inference is done by using Gibbs sampling, see Subsection 9.4.2. 2

2.4 List of Problems


1. Consider the following statistical model to analyze student’s results in a
written exam. Each student’s result can be A, B, C, D, where A is the best
grade and D means fail. The parameter θ is the level of preparation 0, 1, 2,
where 0 means nothing was done for the exam, 1 means that the student
invested some time and the highest level 2 means the student was very well
prepared. The probability Pθ (x) is given in the following table:

x D C B A
θ=0 0.8 0.1 0.1 0
θ=1 0.2 0.5 0.2 0.1
θ=2 0 0.1 0.5 0.4

The probability that a student has done nothing is 0.1 and that the student
is very well prepared is 0.3. Calculate the posterior distribution for each x.
2. Assume X|θ ∼ Pθ and θ ∼ π. Show that:
The statistic T = T (X) is sufficient iff the posteriors of T and X coincide:
π(θ|X) = π(θ|T (X)).
3. Assume X|n ∼ Bin(n, 0.5). Find a prior on n such that n|X ∼ NB(x, 0.5).
28 BAYESIAN MODELLING
4. Consider a sample X = (X1 , . . . , Xn ) from N(0, σ 2 ), n = 4. The unknown
parameter is the precision parameter θ = τ = σ −2 . As prior distribution of
τ , we set Gamma(2, 2). Derive the posterior distribution.
5. In summer 2020, in a small town, 1000 inhabitants were randomly chosen
and their blood samples are tested for Corona antibodies. 15 persons got
a positive test result. From another study in Spring it was noted that the
proportion of the population with antibodies is around 2%. Let X be the
number of inhabitants tested positive in the study. Let θ be the probability
of a positive test. Two different Bayes models are proposed.
• M0 : X ∼ Bin(1000, p), p ∼ Beta(1, 20)
• M1 : X ∼ Poi(λ), λ ∼ Gamma(20, 1)
(a) Derive the posterior distributions for both models.
(b) Discuss the differences between the two models.
(c) Give a recommendation.
6. Consider a multiple regression model

yi = β0 + xi β1 + zi β2 + εi , i = 1, . . . , n

where εi are i.i.d. normallydistributed with n expectationzero and vari-


2 n n
ance
n σ = 0.25. Further
n x
i=1 i = 0, z
i=1 i = 0, i=1 i zi = 0,
x
2 2
x
i=1 i = n, and z
i=1 i = n. The unknown three dimensional parame-
ter β = (β0 , β1 , β2 )T is normally distributed with mean μ = (1, 1, 1)T and
covariance matrix Σ = I3 . Determine the posterior distribution of β.
7. Hierarchical model. The observations belong to independent random vari-
ables X = (Xij ), where

Xij ∼ N(θi , 1), i = 1, . . . , n, j = 1, . . . , k.

The parameters θ1 , . . . , θn are independent and normally distributed with

θi |μ ∼ N(μ, 1), μ ∼ N(0, 1).

(a) Determine the prior for θ = (θ1 , . . . , θn )T .


(b) Calculate the posterior distribution of θ given x.
n
(c) Calculate the posterior distribution of θ̄ = n1 i=1 θi given x.
Chapter 3

Choice of Prior

The main difference between a statistical model and a Bayes model lies in the
assumption that the parameter of the data generating distribution is the re-
alization of a random variable from a known distribution. This distribution is
called prior distribution. The parameter is not observed but its prior distribu-
tion is assumed to be known. The choice of the prior distribution is essential.
In this chapter we present different principles for determining the prior.

Principles of Modelling
We begin by summarizing a general set of principles of statistical modelling.
These principles are valid for a statistical model P = {Pθ : θ ∈ Θ} following
Definition 2.1 where we aim to determine a family of distributions. Concerning
Bayes model {P, π}, given in Definition 2.3, we need to model both a family
of distributions and a single prior distribution. Thus the principles become
even more important.
• First, all statistical models need assumptions for their valid applications.
It is, therefore, worth keeping in mind, while developing the mathematical
properties of models, that these properties will only hold as long as the
assumptions do.
• Second, the development or selection of a statistical model should be
objective-oriented. A model which is good for prediction, for example, may
not be appropriate to explore certain relationships between variables.
• Third, since models are generally only approximative, it is highly recom-
mended to be pragmatic rather than perfectionist. One should thus take
the risk of setting necessary assumptions and, more so, keep track of them
while analyzing the data under the postulated model.
• Fourth, it makes much sense to fit different models to the same problem
and compare them.
• Finally, even with all the aforementioned requisites fully taken care of, it
is always befitting to recall G.E.P. Box’s well-known adage: Essentially all
models are wrong, but some are useful. The practical question is how wrong
they have to be to not be useful. In a nutshell, as in everyday life, one should

DOI: 10.1201/9781003221623-3 29
30 CHOICE OF PRIOR

6
6

prior

5
5

likelihood
posterior prior
likelihood

4
4

posterior

3
3

2
2

1
1

0
0

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

theta theta

Figure 3.1: Example 3.1. Left: The prior has a small variance and a bad guess, and
it dominates the likelihood. Right: The prior involves the same bad guess but also
a high uncertainty. It does not dominate.

tread the path of pragmatism, trying to reduce the number of wrongs. It


often suffices for good statistical modelling practice.
The new computer tools make it possible to handle almost all combinations
of likelihood functions and prior distributions for deriving a posterior distri-
bution by using
π(θ|x) ∝ π(θ)(θ|x). (3.1)
From this point of view we have big freedom in our choice, but of course it has
to be reasonable. That the prior influences the posterior can be acceptable,
but a prior which completely dominates the likelihood, makes the statistical
study useless. The most extreme example is that the prior is the one point
distribution in θ0 . Then we know the data generating distribution and no
experiment is needed. Let us discuss an example.

Example 3.1 (Dominating prior)


Let X = (X1 , . . . , Xn ) be an i.i.d. sample in Example 2.12, where the true
underlying data generating distribution is N(μ, σ 2 ). First we set as prior
N(μ0 , σa2 ), where μ0 is far away from μ and the variance σa2 is small. Roughly
speaking, we are sure that the parameter is close to μ0 . Alternatively we use a
normal prior N(μ0 , σb2 ) with the same expectation but with much larger vari-
ance σb2 , in order to express our uncertainty. In Figure 3.1 we see that the first
prior dominates the likelihood. The more carefully chosen second prior looks
reasonable. 2

There are two main approaches for the choice of a prior.


SUBJECTIVE PRIORS 31

Figure 3.2: Example 3.2. Young lions are more active.

(i) The prior is chosen in a pragmatic way to incorporate additional knowledge.


Sometimes it is called the subjective prior.
(ii) The prior is derived from a theoretical background. It should fulfill desired
properties, such as invariance, or it belongs to a preferred distribution fam-
ily. Here some times the name objective Bayesianism is used.
In the following we explain the main methods of both approaches.

3.1 Subjective Priors


First we consider the following toy example.

Example 3.2 (Lion’s appetite)


Recall Example 2.4. The appetite of a lion has three levels: hungry (θ1 ),
moderate (θ2 ), lethargic (θ3 ). An adult lion is only some times hungry, then
he eats so much that he is lethargic for the next time. The moderate stage is
unusual. Asking a zoologist, she proposed without doubt the following prior
distribution for an adult animal:
θ1 θ2 θ3
.
π(θ) 0.1 0.1 0.8

Young lions are more active. In this case her prior is:

θ1 θ2 θ3
.
π(θ) 0.3 0.1 0.6
2
32 CHOICE OF PRIOR

5
4
August 1989
3
2
1
0 August 1991

0.0 0.2 0.4 0.6 0.8 1.0

theta

Figure 3.3: Example 3.3. The subjective priors are determined by experiences.

Let us continue with a real study, where biological knowledge is explored. It


is a practical approach to explore the variability of a parametric distribution
family. The parameters of the prior are called hyperparameters. The idea is
to determine hyperparameters in such a way that the prior curve reflects
the subjective knowledge. We will demonstrate this method by the following
examples.

Example 3.3 (Capture probability)


Let us again consider the study in Example 2.19 published in Dupuis (1995).
The parameter θj is the probability that a marked lizard is caught at time j.
Let j = 2, which is August in the same year, and j = 6, the August two years
later. The probability to catch a lizard after two years again is smaller than in
the attempt. The expected values are 0.3 and 0.2 respectively, the variances
are around 0.01. In Dupuis (1995) beta distributions are applied and the
hyperparameters are chosen by taking into account biological rhythms of this
species; see Figure 3.3. 2

Example 3.4 (Determining hyperparameters)


Consider an i.i.d. sample from a normal distribution N(0, σ 2 ). The parameter
of interest θ is the variance σ 2 . We want to incorporate the additional infor-
mation that the variance is most probably around 2 and with relatively high
SUBJECTIVE PRIORS 33

0.6
1.0 ●

0.5
0.8

0.4
prior density
0.6

0.3
0.2
0.4

0.1
0.2
0.2

0.0
0 1 2 3 4 5
0 5 10 15 20 25 30
theta
alpha

Figure 3.4: Example 3.4. Left: Determining α by P (θ > 3) ≈ 0.8. The distribution
function of InvGamma(α, 2(α + 1)) at 3 and the line at level 0.8 are plotted. Right:
The prior density with α = 10 and β = 22 reflects the subjective knowledge.

probability less that 3. This can be expressed as

mode = 2 and P(σ 2 < 3) > 0.8.

As prior we assume an inverse-gamma distribution, InvGamma(α, β) with


density given in (7.26). The hyperparameters are α and β. The mode of
InvGamma(α, β) is
β
mode =
α+1
Thus β = 2(α + 1). The inverse-gamma distribution is included in R. By a
simple graphical method we determine α = 10 such that the second condition
is fulfilled; see following R Code and Figure 3.4. The proposed subjective prior
is InvGamma(10, 22). 2

R Code 3.1.1. Determining hyperparameters, Example 3.4.

library(invgamma) # package for inverse-gamma distribution


a<-seq(0.5,30,0.01) # hyperparameter alpha
b<-2*(a+1) # hyperparameter beta; the mode is defined as 2
plot(a,pinvgamma(3,a,b),"l",xlab="alpha",ylab="")
# distribution function at 3 as function of alpha
lines(0:30,rep(0.8,31),lty=2) # the desired level
points(10,pinvgamma(3,10,22)) # approximative crossing point
segments(10,0,10,pinvgamma(3,10,22),lty=2) # determining alpha
In the following example we are interested in the expected value of a nor-
mal distribution. We have only vague information and choose a heavy tailed
distribution as prior.
34 CHOICE OF PRIOR

0.08

0.06
subjective prior density

0.04
0.02

0.25
0.00

−10 −5 0 5 10

theta

Figure 3.5: Example 3.5. The subjective prior reflects the vague subjective informa-
tion.

Example 3.5 (Determining hyperparameters)


Consider an i.i.d. sample from a normal distribution N(μ, 1). The parameter
of interest θ is the expected value μ. We guess that the parameter μ is lying
symmetrically around 1 and the probability that the value is larger than 5 is
around 0.25, i.e.,

mode(π) = 1 and P(μ > 5) ≈ 0.25.

We assume a Cauchy distribution C(1, γ), which is heavy tailed and symmet-
rical around 1. The hyperparameter is the scaling parameter γ. The quantile
function of C(x0 , γ) is given by

q(p) = x0 + γ tan(π(p − 0.5)).

In this case x0 = 1, p = 0.75 and q(p) = 5. Because tan( π4 ) = 1 we get γ = 4.


The proposed subjective prior is C(1, 4); see also Figure 3.5. 2

Another way to implement subjective information into a prior distribution is


as following. Sometimes it is easier to have a good guess for the data point
than for the parameter distribution. This information can also be explored
for determining hyperparameters. For detailed discussion of this method, see
Berger (1980, Section 3.5). Here we only present an example.
SUBJECTIVE PRIORS 35

Example 3.6 (Determining hyperparameters via prediction)


We continue Example 2.11 and consider the Bayes model θ ∼ Beta(α0 , β0 ),
and X|θ ∼ Bin(n, θ). The marginal distribution of X can be calculated by
(2.6), as
 1 
n α0 +k−1
P(X = k) = θ (1 − θ)β0 +n−k−1 B(α0 , β0 )−1 dθ,
0 k

where B(a, b) is the beta function as the solution of the integral equation
 1
B(a, b) = xa−1 (1 − x)b−1 dx, a > 0, b > 0. (3.2)
0

We obtain the beta-binomial distribution BetaBin(n, α0 , β0 ),


 
n B(α0 + k, β0 + n − k)
P(X = k) = . (3.3)
k B(α0 , β0 )
These probabilities can be calculated, using the beta function implemented in
R, see Figure 3.6 and R Code 2. Let us now illustrate two different cases to
determine α0 and β0 . We begin by assuming equal probability,
1
P(X = k) = . (3.4)
n+1
Using  
n 1
= ,
k (n + 1)B(k + 1, n − k + 1)
we get
1 B(α0 + k, β0 + n − k)
P(X = k) = . (3.5)
n + 1 B(k + 1, n − k + 1)B(α0 , β0 )
For α0 = 1 and β0 = 1 we obtain (3.4). Recall that it is the Bayes model in
the historical Example 2.9.
Now we assume that the marginal distribution in (3.3) is symmetric and the
highest probability is around 0.3. The symmetry P(X = k) = P(X = n − k)
is fulfilled for α0 = β0 , and we plot P(X = n − k) as a function of α to ob-
tain α0 , see Figure 3.6 and following R Code. The assumed subjective prior
is Beta(3.8, 3.8). 2

R Code 3.1.2. Determining hyperparameters, Example 3.6.

n<-4; a=3.8; b=3.8; p<-rep(0,n+1)


for (i in 1:(n+1)){
p[i]<-beta(a+i-1,b+n-i+1)/beta(a,b)*choose(n,(i-1))
}
plot(0:n+1,p,"h",ylab="marginal probability",
xlab="x",lwd=7, col=grey(0.6))
36 CHOICE OF PRIOR
0.30

0.32
0.25


marginal probability

0.28
P(X=2)
0.20

0.24
0.15

0.20
1 2 3 4 5 2 4 6 8 10

x alpha

Figure 3.6: Example 3.4. Left: The marginal distribution for X is plotted. Right:
The parameter of the prior is determined such that the marginal distribution has
the desired properties.

aa=seq(1,10,0.01); a0=3.8
plot(aa,beta(aa+2,aa+2)/beta(aa,aa)*choose(n,n/2),"l")
lines(aa,rep(0.3,length(aa)),"l",lty=2)
points(a0,6*beta(a0+2,a0+2)/beta(a0,a0),lwd=3)
segments(a0,6*beta(a0+2,a0+2)/beta(a0,a0),a0,0,lty=2)
Another good proposal for a subjective prior is a mixture distribution. Mix-
ture distributions are rich parametric classes. A folk theorem says that every
distribution can be approximated by a mixture distribution, which is used in
several contexts without a clear statement. We quote here a recent result of
Nguyen et al. (2020).

Theorem 3.1 (Approximation by mixture) Assume g : Rp → R and


for all x ∈ Rp there exist two positive constants c1 , c2 such that

|g(x)| ≤ c1 (1 + x p2 )−p−c2 . (3.6)

Then for any continuous function f : Rp → R, there exists a sequence


hm : Rp → R of mixtures

m
1 x − μj m
hm (x) = cj p g( ), σj > 0, μj ∈ Rp , cj > 0, cj = 1
j=1
σj σj j=1

such that 
lim |f (x) − hm (x)|dx = 0.
m→∞

The following classroom example gives some insight in modelling by a mixture


distribution.
SUBJECTIVE PRIORS 37

0.10
0.08
subjective prior

0.06
0.04
0.02
0.00

−5 0 5 10 15 20

theta

Figure 3.7: Illustration for Example 3.7. The subjective prior is a mixture of normal
distributions.

Example 3.7 (Weather)


We are interested in the temperature at noon in February. The parameter
of interest is the expectation of the temperature θ. We ask two experts for
their knowledge. One expert guesses that the winter is mild with temperature
around 8 ◦ C. The other expert has a cold winter in mind. Believing in the
climate change we give the first expert the weight 0.8 but higher uncertainty.
The subjective prior of the first expert is proposed to be N(8, 10), the sub-
jective prior of the second expert comes out by N(−3, 4). Summarizing, we
assume a mixture prior
π(θ) = 0.8 φ(8,10) (θ) + 0.2 φ(−3,4) (θ), (3.7)
where φ(μ,σ2 ) denotes the density of N(μ, σ 2 ), see Figure 3.7. This mixture dis-
tribution fulfills the assumptions of the above theorem with g(.) = φ(0,1) (.). 2
In principle whenever it is possible to determine a weight function over the
parameter set and we can calculate the weight function at every argument, we
can apply a Bayesian analysis; see Figure 3.8. There are no closed expressions
for the prior or for the likelihood function needed. This approach is explained
in Chapter 9 on Bayesian computations.
38 CHOICE OF PRIOR

Figure 3.8: Subjective prior for the expected value of hours of sunshine.

In the next section we present the construction of a parametric distribution


family, which includes both the prior and the respective posterior. Such a
distribution family is called conjugate and it is highly recommended.

3.2 Conjugate Priors


This approach was introduced in Raiffa and Schlaifer (1961). We start with
the definition of a conjugate family.

Definition 3.1 (Conjugate family) For a given statistical model P =


{Pθ : θ ∈ Θ} a family F of probability distributions on Θ is called conjugate
iff for every prior π(.) ∈ F and for every x the posterior π(.|x) ∈ F.

The family F is also called closed under sampling. The prior distribution
which is an element of a conjugate family is named conjugate prior.

This family is very practical, since only an updating of the hyperparameters is


needed to compute the posterior. We illustrate it with the help of the flowers
example (Example 2.1).
CONJUGATE PRIORS 39
Example 3.8 (Flowers)
The data (n1 , n2 , n3 ) are the number of flowers of respective colours. The
sample size n = n1 + n2 + n3 is fixed. The data generating distribution a
multinomial, Mult(n, p1 , p2 , p3 ):
n!
P(n1 , n2 , n3 ) = p n 1 p n 2 pn 3 . (3.8)
n1 ! n2 ! n3 ! 1 2 3
The unknown parameter θ = (p1 , p2 , p3 ) consists of the probabilities for each
colour, so that p1 + p2 + p3 = 1. We consider as prior for θ a Dirichlet distri-
bution Dir3 (e1 , e2 , e3 ) with

π(θ1 , θ2 , θ3 |e1 , e2 , e3 ) ∝ θ1e1 −1 θ2e2 −1 θ3e3 −1 . (3.9)

The posterior is calculated by

π(θ|(n1 , n2 , n3 )) ∝ π(θ1 , θ2 , θ3 |e1 , e2 , e3 )P(n1 , n2 , n3 ).

We obtain
π(θ|(n1 , n2 , n3 )) ∝ θ1e1 +n1 −1 θ2e2 +n2 −1 θ3e3 +n3 −1 ,
which is the kernel of a Dirichlet distribution with parameters:

e 1 + n1 , e 2 + n2 , e 3 + n3 . (3.10)

Thus the set of Dirichlet distributions

F = {Dir3 (α1 , α2 , α3 ) : α1 > 0, α2 > 0, α3 > 0}

forms a conjugate family. The prior in (3.9) is a conjugate prior. To com-


pute the posterior, we apply the updating rule (3.10). The hyperparameters
e1 , e2 , e3 in (3.9) can be determined by subjective information. We know that
violet is the common colour, white occurs also, but pink is an exception. Set
e0 = e1 + e2 + e3 and using the subjective expected values of the components,
e1 e2 e3
Eθ1 = = 0.2, Eθ2 = = 0.7, Eθ3 = = 0.1,
e0 e0 e0
we choose e1 = 2, e2 = 7, e3 = 1. In this illustrative example we assume a
subjective prior which is also conjugate, i.e.,

θ ∼ Dir3 (2, 7, 1).

In Example 2.11 the statistical model is a binomial distribution with unknown


success probability, the prior and the posterior distribution are beta distribu-
tions. Example 2.12 is also a Bayes model with a conjugate prior, where the
i.i.d. sample is from a normal distribution, the parameter of interest is the
expected value and the conjugate prior is a normal distribution.
40 CHOICE OF PRIOR
In Example 2.13 the i.i.d. sample is from a normal distribution, the parameter
of interest is the inverse variance, the conjugate prior is a gamma distribution.
The same set up was also considered in Example 2.14. Here the parameter of
interest is the variance and the conjugate prior is an inverse-gamma distribu-
tion.

In Example 2.16, however, the prior is not conjugate.

The common property of the data generating distributions in Examples 2.11,


3.8 and 2.12 is that the related statistical model belongs to an exponential
family. In the following we state a general result for the construction a con-
jugate family for data generating distributions belonging to an exponential
family.

First we define the exponential family.

Definition 3.2 (Exponential family of dimension k) A statistical


model P = {Pθ : θ ∈ Θ} belongs to an exponential family of dimension
k, if there exist real-valued functions ζ1 , . . . , ζk on Θ, real-valued statistics
T1 , . . . , Tk and a function h on X such that the probability function has
the form

k
p(x|θ) = C(θ) exp( ζj (θ)Tj (x))h(x). (3.11)
j=1

The representation (3.11) is not unique. The exponential family is called


strictly k-dimensional, iff it is impossible to reduce the number of terms in
the sum in (3.11).

Many distributions belong to exponential family. The exponential families


have several interesting analytical properties. Especially in statistical infer-
ence, estimation and testing theory has been extensively developed for expo-
nential families; see for instance Liero and Zwanzig (2011). For more theoret-
ical results on exponential families we recommend Liese and Miescke (2008),
see also Robert (2001) and the literature therein.

The main trick is to reformulate the presentation of the probability function


such that it has the form (3.11). Hereby, of most interest are the sufficient
statistic T (x) = (T1 , . . . , Tk ) and the parameter ζ(θ) = (ζ1 , . . . , ζk ). C(θ) is
the normalizing constant. The new parameter ζ = ζ(θ) = (ζ1 , . . . , ζk ) is called
natural parameter.

We demonstrate it by three examples. Consider the distributions in Examples


3.8 and 2.12.
CONJUGATE PRIORS 41
Example 3.9 (Multinomial distribution)
The multinomial distribution Mult(n, p1 , . . . , pk ) is a discrete distribution
with probability function
n!
P(n1 , . . . , nk ) = pn1 . . . pnk k , (3.12)
n1 ! . . . nk ! 1
where

k 
k
ni = n, pi = 1. (3.13)
i=1 i=1

Using the relation A,

z a = exp(ln(z a )) = exp(a ln(z)) (3.14)

we obtain

k
P(n1 , . . . , nk ) = h(n1 , . . . , nk ) exp( ln(pi )ni ),
i=1

with
n!
h(n1 , . . . , nk ) = .
n1 ! . . . nk !
Hence the multinomial distribution belongs an exponential family. By using
(3.13) we can reduce the dimension. Since,


k 
k−1
pi  k−1
ln(pi )ni = ni ln( ) + n ln(pk ); pk = 1 − pi .
i=1 i=1
pk i=1

the multinomial distribution belongs to a (k − 1)-dimensional exponential


family with
p1 pk−1
T = (n1 , . . . , nk−1 ) and ζ = (ln( ), . . . , ln( )).
pk pk
2

Example 3.10 (Dirichlet distribution)


k
For x = (x1 , . . . , xk ) with i=1 xi = 1, the Dirichlet distribution
Dirk (α1 , . . . , αk ) has the density
Γ(α0 ) k −1
f (x|α1 , . . . , αk ) = xα1 −1 . . . xα , (3.15)
Γ(α1 ) . . . Γ(αk ) 1 k

k
where α0 = i=1 αi . Using the relation (3.14) we rewrite

Γ(α0 ) k
f (x|α1 , . . . , αk ) = exp( ln(xi )(αi − 1)).
Γ(α1 ) . . . Γ(αk ) i=1
42 CHOICE OF PRIOR
Set h(x) = x−1
1 . . . x−1
k . We get

Γ(α0 ) k
f (x|α1 , . . . , αk ) = h(x) exp( ln(xi )αi ).
Γ(α1 ) . . . Γ(αk ) i=1

The family of Dirichlet distributions forms a k-dimensional exponential family


with
T (x) = (ln(x1 ), . . . , ln(xk ))
and
ζ = (α1 , . . . , αk ).
2

Example 3.11 (Normal distribution)


The normal distribution N(μ, σ 2 ) has the density
 
1 (x − μ)2
f (x|μ, σ ) = √
2
exp − ,
2πσ 2 2σ 2

which can be decomposed into


   
1 μ2 x2 xμ
f (x|θ) = √ exp − 2 exp − 2 + 2 .
2πσ 2 2σ 2σ σ

Thus the normal distribution is a two-parameter exponential family with θ =


(μ, σ 2 ) where
1 μ
T (x) = (x2 , x) and ζ(θ) = (− 2 , 2 ).
2σ σ
2

Theorem 4.12 gives a very useful result on exponential families, namely that
for an i.i.d. sample to follow an exponential family, it suffices that Xi follows
an exponential family; see Liero and Zwanzig (2011).

Theorem 3.2
If X = (X1 , . . . , Xn ) is an i.i.d. sample from a distribution of the form
(3.11) with functions ζj and Tj , j = 1, . . . , k then the distribution of X
belongs to an exponential family with parameter ζj , and statistics


n
T(n,j) (x) = Tj (xi ), j = 1, . . . , k.
i=1
CONJUGATE PRIORS 43
Proof: Since the distribution of Xi belongs to an exponential family the
probability function of the sample is given by

n 
k
p(x|θ) = C(θ) exp( ζj (θ)Tj (xi ))h(xi )
i=1 j=1


k 
n
= C(θ)n exp( ζj (θ) Tj (xi ))h̃(x) (3.16)
j=1 i=1
n
with h̃(x) = i=1 h(xi ). Thus the distribution of X = (X1 , . . . , Xn ) belongs
nan exponential family with the functions ζj and the statistics T(n,j) (x) =
to
i=1 Tj (xi ).
2
We apply this result to a normal i.i.d. sample.

Example 3.12 (Normal distribution)


Consider an i.i.d. sample X1 , . . . , Xn from N(μ, σ 2 ). Then the distribution
of X = (X1 , . . . , Xn ) belongs to a two-parameter exponential family with
θ = (μ, σ 2 ) and

n 
n
Tn (x) = ( xi , x2i )
i=1 i=1
and
μ 1
ζ(θ) = ( 2
, − 2 ).
σ 2σ
2
The following theorem tells how we can get the conjugate family for the natural
parameter of a statistical model belonging to a k-dimensional exponential
family. For convenience we write the probability function as
p(x|θ) = h(x) exp(θT T (x) − Ψ(θ)), (3.17)
where x ∈ R and θ ∈ Θ ⊆ R ; the function Ψ(θ)) = − ln(C(θ)) is called
n k

cumulant generating function; see Robert (2001).

Theorem 3.3
Assume X has a distribution of the form (3.17). Then the conjugate family
over Θ is given by

F = {π(θ|μ, λ) = K(μ, λ) exp(θT μ − λΨ(θ)) : μ ∈ Rk , λ ∈ R, λ > 0}.


(3.18)
The posterior belonging to the prior π(θ|μ0 , λ0 ) has the parameters

μ = μ0 + T (x) and λ = λ0 + 1.
44 CHOICE OF PRIOR
Proof: Suppose π(θ) ∈ F such that

π(θ) = K(μ0 , λ0 ) exp(θT μ0 − λ0 Ψ(θ)).

The posterior is determined by π(θ)l(θ|x). Thus

π(θ|x) ∝ K(μ0 , λ0 ) exp(θT μ0 − λ0 Ψ(θ))h(x) exp(θT T (x) − Ψ(θ))


∝ exp(θT μ0 − λ0 Ψ(θ)) exp(θT T (x) − Ψ(θ))
∝ exp(θT (μ0 + T (x)) − (λ0 + 1)Ψ(θ)).

Hence the posterior belongs to F with parameters μ0 + T (x) and λ0 + 1.


2
We demonstrate the application of Theorem 3.3 by the following example.

Example 3.13 (Gamma distribution and conjugate prior)


Consider an observation X ∼ Gamma(α, β) with known shape parameter
α > 0. Parameter of interest θ is the rate parameter β > 0. The density of
Gamma(α, θ) is
θα α−1
f (x|α, θ) = x exp(−θx). (3.19)
Γ(α)
Hence the statistical model belongs to a 1-dimensional exponential family of
form (3.17) with natural parameter θ and

xα−1
T (x) = −x, Ψ(θ) = −α ln(θ), h(x) = .
Γ(α)

Applying Theorem 3.3, a conjugate prior is

π(θ) ∝ exp(θμ0 − λ0 Ψ(θ))

and the posterior

π(θ|x) ∝ exp(θ(μ0 − x) − (λ0 + 1)Ψ(θ)).

For μ0 < 0 we can rewrite the kernels as those of gamma distributions. We


have
π(θ) ∝ θαλ0 exp(θμ0 ),
π(θ|x) ∝ θα(λ0 +1) exp(θ(μ0 − x))
which are the kernels of Gamma(λ0 α−1, −μ0 ) and Gamma(α(λ0 +1)−1, −μ0 +
x) respectively. Summarizing we notice that if θ ∼ Gamma(a0 , b0 ) with a0 > 0
and b0 > 0 then θ|x ∼ Gamma(a0 + α, b0 + x). 2

Generalized linear models also belong to exponential families. Let us get back
to the Corona Example 2.5.
CONJUGATE PRIORS 45
Example 3.14 (Logistic regression)
We consider a general logistic regression model. The data set is given by
(yi , x1i , . . . , xpi ) with i = 1, . . . , n, where y1 , . . . , yn are independent re-
sponse variables. The success probability depends on the covariates xi =
(x1i , . . . , xpi )T , i.e.,
P(yi = 1|xi , θ) = p(xi ).
The statistical model is

n
p(y1 , . . . , yn |xi , θ) = p(xi )yi (1 − p(xi ))(1−yi )
i=1

n 
n
= exp ln(p(xi ))yi + ln(1 − p(xi ))(1 − yi )
i=1 i=1
n   
n
p(xi )
= exp ln yi + ln(1 − p(xi )) .
i=1
1 − p(xi ) i=1
(3.20)
The main trick is to find a canonical link function g(.), such that
 n   n
p(xi )
ln yi = yi xTi θ.
i=1
1 − p(x i ) i=1

In this case g(.) is the logistic function


 
exp(z) g(z)
g(z) = , ln = z.
1 + exp(z) 1 − g(z)
Assuming
p(xi ) = g(xTi θ), (3.21)
we get

n
p(y1 , . . . , yn |x1 , . . . , xp , θ) = exp yi xTi θ − Ψ(θ|x1 , . . . , xp ) .
i=1

Hence under Assumption (3.21), the statistical model is a p-dimensional ex-


ponential family with natural parameter θ and

n 
n 
n
T (y) = yi xi , Ψ(θ|x1 , . . . , xp ) = − ln(1−p(xi )) = ln(1+exp(xT
i θ)).
i=1 i=1 i=1

Applying Theorem 3.3, a conjugate prior is


π(θ|x1 , . . . , xp ) ∝ exp(θT μ0 − λ0 Ψ(θ|x1 , . . . , xp )).
2
46 CHOICE OF PRIOR
The following table collects the conjugate priors and corresponding posteriors
related to some popular one parameter exponential families with parameter
θ, assuming all other parameters known, where the posteriors are computed
for a single observation x. The table is partly taken from Robert (2001).
Distribution Prior Posterior
p(x|θ) π(θ) π(θ|x)
Normal Normal Normal
 
N(θ, σ 2 ) N(μ, τ 2 ) N ρ(σ 2 μ + τ 2 x), ρσ 2 τ 2
ρ−1 = σ 2 + τ 2
Poisson Gamma Gamma
Poi(θ) Gamma(α, β) Gamma(α + x, β + 1)
Gamma Gamma Gamma
Gamma(ν, θ) Gamma(α, β) Gamma(α + ν, β + x)
Binomial Beta Beta
Bin(n, θ) Beta(α, β) Beta(α + x, β + n − x)
Negative Binomial Beta Beta
NB(m, θ) Beta(α, β) Beta(α + m, β + x)
Multinomial Dirichlet Dirichlet
Multk (θ1 , . . . , θk ) Dir(α1 , . . . , αk ) Dir(α1 + x1 , . . . , αk + xk )
Normal Gamma Gamma
  (μ−x)2
N μ, θ1 Gamma(α, β) Gamma(α + 12 , β + 2 )
Normal InvGamma InvGamma
(μ−x)2
N (μ, θ) InvGamma(α, β) InvGamma(α + 12 , β + 2 )

In case we want to include the prior knowledge from different sources we can
also construct an averaging prior which is conjugate. The following theorem
is an extension of Theorem 3.3 to mixtures.
CONJUGATE PRIORS 47

Theorem 3.4 (Mixture of exponential distributions)


Assume X has a distribution of the form

p(x|θ) = exp(θT T (x) − Ψ(θ))h(x).

Then the set of mixture distributions on Θ,

N 
N
FN = { ωi π(θ|μi , λi ) : ωi = 1, ωi > 0,
i=1 i=1
π(θ|μi , λi ) = K(μi , λi ) exp(θT μi − λi Ψ(θ)),
μi ∈ Rk , λi ∈ R}
(3.22)

is a conjugate family. For the prior


N
π(θ) = ωi π(θ|μi , λi ),
i=1

the posterior is


N
π(θ|x) = ωi (x)π(θ|μi + T (x), λi + 1)
i=1

with
ωi K(μi , λi )
ωi (x) ∝ .
K(μi + T (x), λi + 1)

Proof: Applying Theorem 3.3 we get

π(θ|x) ∝ π(θ)(θ|x)

N
∝ ωi K(μi , λi ) exp(θT μi − λi Ψ(θ)) exp(θT T (x) − Ψ(θ))
i=1

N
∝ ωi K(μi , λi ) exp(θT (μi + T (x)) − (λi + 1)Ψ(θ))
i=1

N
∝ ωi (x)K(μi + T (x), λi + 1) exp(θT (μi + T (x)) − (λi + 1)Ψ(θ)).
i=1
(3.23)
2
We now turn back to the classroom Example 3.7 and Figure 3.7.
48 CHOICE OF PRIOR
Example 3.15 (Weather)
We assume that the temperature measurements are normally distributed.
Thus  
1 1 x2 1 1 2
f (x|θ) = √ exp(− 2 ) exp θx − θ .
2π σ 2σ σ2 2σ 2
We have
1 2 1
Ψ(θ) =θ , T (x) = 2 x.
2σ 2 σ
From Theorem 3.3 a conjugate family is
1 2
F = {K(μ, λ) exp(θμ − λ θ ) : μ ∈ R, λ ∈ R, λ > 0}
2σ 2
with √  
1 λ σ2 2
K(μ, λ) = √ exp − μ .
2π σ 2λ
2 2
Parameterizing by τ 2 = σλ and m = σλ μ, it is the family of normal distribu-
tions
F = {N(m, τ 2 ) : m ∈ R, τ 2 > 0}.
Recall Example 2.12 with n = 1, the posterior belonging to the prior N(m, τ 2 )
is N(m(x), τp2 ) where

m(x) = ρ(xτ 2 + mσ 2 ), τp2 = ρστ 2 , ρ = (τ 2 + σ 2 )−1 .

The prior in Example 3.7 fulfills the conditions of Theorem 3.4. It is a mixture
of normal distributions

π(θ) = ω1 φ(m1 ,τ12 ) (θ) + ω2 φ(m2 ,τ22 ) (θ).

Thus the posterior is a mixture of the related posteriors

π(θ|x) = ω1 (x) φ(m1 (x),τ1,p


2 ) (θ) + ω2 (x) φ(m (x),τ 2 ) (θ)
2 2,p
(3.24)

with

mi (x) = ρi (xτi2 + mi σ 2 ), τi,p


2
= ρi στi2 , ρi = (τi2 + σ 2 )−1 , i = 1, 2,

and
τi 1 1
ωi (x) ∝ ωi exp − 2 m2i + 2 mi (x)2
τi,p 2τi 2τi,p

with ω1 (x) + ω2 (x) = 1. Figures 3.9 and 3.10 show the prior and posterior. 2
CONJUGATE PRIORS 49

First Expert Second Expert


0.30

0.30
0.25

0.25
0.20

0.20
0.15

0.15
0.10

0.10
0.05

0.05
0.00

0.00

−5 0 5 10 15 20
−5 0 5 10 15 20
theta
theta

Figure 3.9: Illustration for Example 3.15. Left: The subjective prior (broken line) of
expert 1 and the posterior after observing x = 4. Right: The prior (broken line) and
posterior related to expert 2.
0.20
0.15
0.10
0.05
0.00

−5 0 5 10 15 20

theta

Figure 3.10: Illustration for Example 3.15. The broken line is the prior mixture
distribution. The continuous line is the posterior, which is also a mixture, but the
posterior weight of the second expert is shifted from 0.2 to 0.035, because the ob-
servation x = 4 lies in the tail of the second expert’s prior.
50 CHOICE OF PRIOR
3.3 Non-informative Priors
Basically, the non–informative priors are recommended when no prior knowl-
edge is available but we still want to explore the advantages of Bayesian mod-
elling. However they can be used for other reasons, e.g., when we do not trust
the subjective knowledge and want to avoid any conflict with the likelihood
approach. But it is unclear which prior deserves the name non–informative.
There are different approaches but it is still an open problem and new set ups
are under development. We present here main principles.

3.3.1 Laplace Prior


The name of Laplace is connected with first papers on probability. He defined
the probability of an event A ⊂ Ω as the ratio
number of cases ω belonging to A
P(A) = .
number of all ω in Ω
Behind this definition is the imagination that all elements of Ω have the same
chance to be drawn.
1
P({ω}) = = const.
number of all ω in Ω
It works in case when the number of all ω in Ω is finite. This approach is
generalized as constant probability function

p(ω) ∝ const.

Note that for unbounded Ω, a constant measure is no longer a probability


measure because it cannot be normalized since
 ∞
const = ∞.
−∞

Applying this to the Bayesian context we have the following definition.

Definition 3.3 (Laplace prior) Consider a statistical model P = {Pθ :


θ ∈ Θ}. The Laplace prior is defined as constant

π(θ) ∝ const.

Note that, the notational system in statistics is not unique, so that sometimes
in literature prior distributions following a Laplace distribution are named
Laplace prior, which are of course not constant.
Definition 3.3 is also applied to unbounded Θ. In general, priors which are not
probability measures are called improper priors.
NON-INFORMATIVE PRIORS 51
Suppose a finite set Θ = {θ1 , . . . , θm }. The Shannon entropy is defined by

m
H(π) = − π(θi ) log(π(θi )). (3.25)
i=1

It describes how much the probability mass of π is spread out on Θ. The higher
the entropy the less informative is the parameter. It holds for all j = 1, . . . , m
that

d 
m
1 .
− ki log(ki ) = −kj − log(kj ) = −1 − log(kj ) = 0,
dkj i=1
kj

i.e., the measure with maximal entropy has constant weight on all elements of
Θ. Thus the Laplace prior fulfills the idea of no information. For illustration,
see Figure 3.11.

Example 3.16 (Lion’s appetite)


Consider Example 2.4. The lion has only three different stages: hungry, moder-
ate, lethargic. The Laplace non-informative prior gives every stage the weight
1
3. 2

Consider the Bayes billiard table again.

Example 3.17 (Binomial distribution)


In Example 3.6 we considered a binomial model X|θ ∼ Bin(n, θ) with
x ∈ X = {0, 1, . . . , n} and a beta distributed prior for θ ∈ [0, 1]. In Ex-
1
ample 3.6 we assumed for the marginal distribution of X that P({x}) = n+1 .
The related beta prior is the uniform distribution U [0, 1] which is the Laplace
non-informative prior. 2

Example 3.18 (Normal sample and Laplace prior)


In Example 2.12 we had X = (X1 , . . . , Xn ) i.i.d from N(θ, σ 2 ) with known
variance σ 2 . The parameter space Θ = R is unbounded and the Laplace prior
π(θ) ∝ const is improper. But the posterior
n
π(θ|x) ∝ (μ|x) ∝ exp − (μ − x̄)2
2σ 2
is a normal distribution N(x̄, n1 σ 2 ) which is proper. Furthermore the Laplace
non-informative prior is a limiting case of a normal prior with increasing prior
variance. The prior variance can be interpreted as a measure of uncertainty,
where high variance indicates our higher uncertainty, see Figure 3.11. 2
52 CHOICE OF PRIOR

0.4
0.4

Laplace, E=2.197 N(1,1)


N(1,4)
Bin(8,0.7), E=1.663
N(1,16)

0.3
N(1,64)
0.3

0.2
0.2

0.1
0.1
0.0

0.0
0 2 4 6 8 −4 −2 0 2 4 6

theta theta

Figure 3.11: Laplace non-informative prior. Left: Illustration of spread and entropy.
E stands for the entropy measure (3.25). Right: In Example 3.18 the Laplace prior
can be considered as limit of normal priors under increasing variance.

Furthermore the Bayes analysis with Laplace non-informative priors conforms


to the likelihood approach as long as we do not change the parameter. But
Laplace non-informative priors have a big disadvantage: Laplace priors de-
pend on parametrization.
Assume Θ ⊂ Rp . Consider the transformation of the prior for re-
parametrization. Suppose a bijective function g : θ → η = g(θ) ∈ G ⊂ Rp
and h : η ∈ G ⊂ Rp → θ = h(η) ∈ H ⊂ Rp where ηj = gj (θ) and θi = hi (η),
i, j = 1, . . . , p. Then the Jacobian matrix is defined as
   
∂h ∂h ∂hi (η)
J= ,..., = (3.26)
∂η1 ∂ηp ∂ηj i=1,...,p,j=1,...,p.

Applying the rules of variable transformation for distributions, we have

πη (η) = πθ (h(η))|J|, (3.27)

where |J| = det(J) is the Jacobian determinant. A constant prior with respect
to θ does not deliver a constant prior with respect to η, so that the Laplace
non-informative prior does not fulfill (3.27). We illustrate it in the following
example.

Example 3.19 (Binomial distribution and odds ratio)


Consider Example 2.9, where X|θ ∼ Bin(n, θ) and θ ∼ U[0, 1]. The parameter θ
is the success probability of a binomial distribution. An alternative parameter
η is the odds ratio, the ratio between the probabilities of success and failure.
θ η
η= , θ= .
1−θ 1+η
NON-INFORMATIVE PRIORS 53
The Jacobian is  
d η 1
J= = .
dη 1+η (1 + η)2
Applying (3.27) we obtain, under θ ∼ U[0, 1], the prior over [0, ∞), as
1
π(η) = .
(1 + η)2
Note that, this distribution does not belong to a conjugate family. 2

3.3.2 Jeffreys Prior


In Jeffreys (1946) Harold Jeffreys proposed non–informative priors which are
invariant to re–parametrization. His main argument was that if we are, first
and foremost, interested in information on the data generating distribution
Pθ , and less in the parameter θ itself, then the distance between distributions
matters.

For two probability measures P, Q, the Kullback–Leibler divergence K(P|Q)


measures the information gained if P is used instead of Q. Let P, Q be con-
tinuous distributions with densities p, q, respectively, then

p(x)
K(P|Q) = p(x) ln( )dx. (3.28)
q(x)
The Kullback–Leibler divergence is not symmetric. For a symmetric metric
we use

p(x)
I2 (P, Q) = K(P|Q) + K(Q|P) = (p(x) − q(x)) ln( )dx.
q(x)
Jeffreys argument goes as following. For P = Pθ and Q = Pθ we obtain

p(x|θ)
I2 (Pθ , Pθ ) = (p(x|θ) − p(x|θ )) ln( )dx.
p(x|θ )
Let p (x|θ) denote the p-dimensional vector of first partial derivatives with
respect to θ. Then, using

p(x|θ) − p(x|θ ) ≈ p (x|θ)T (θ − θ ),

and
p(x|θ) 1
ln( ) = ln(p(x|θ)) − ln(p(x|θ ) ≈ p (x|θ)T (θ − θ ),
p(x|θ ) p(x|θ)
we obtain

 T 1
I2 (Pθ , Pθ ) ≈ (θ − θ ) p (x|θ)p (x|θ)T dx (θ − θ ).
p(x|θ)
54 CHOICE OF PRIOR
From the definition of the score function
∂ ln p(x|θ) 1
V (θ|x) = = p (x|θ)
∂θ p(x|θ)
and the Fisher information matrix is (see Liero and Zwanzig, 2011)
I(θ) = Covθ V (θ|x). (3.29)
We have
 
1 1
p (x|θ)p (x|θ)T dx = p (x|θ)p (x|θ)T p(x|θ)dx
p(x|θ) p(x|θ)2
= Eθ (V (θ|x)V (θ|x)T )
= Covθ (V (θ|x))
= I(θ)
so that
I2 (Pθ , Pθ ) ≈ (θ − θ )T I(θ)(θ − θ ). (3.30)
Thus the distance between the probability measures generates a weighted
distance in the parameter space. Changing the parametrization and using
θ − θ ≈ J(η − η  )
where J is the Jacobian matrix defined in (3.26), we have
I2 (Pθ , Pθ ) ≈ (θ − θ )T I(θ)(θ − θ )
≈ (η − η  )T JT I(θ)J(η − η  ).

Further, by the chain rule, the score function V (η|x)) with respect to η is

V (η|x)) = V (θ|x))T J
which gives
I(η) = JT I(θ)J (3.31)
and we see that different parametrizations for the same probability distribu-
tions deliver the same distance
I2 (Pθ , Pθ ) = I2 (Pη , Pη ).
The relation (3.31) is Jeffreys main argument for an invariant prior.

Definition 3.4 (Jeffreys prior) Consider a statistical model P = {Pθ :


θ ∈ Θ} with Fisher information matrix I(θ).
The Jeffreys prior is defined as
1
π(θ) ∝ det(I(θ)) 2 .
NON-INFORMATIVE PRIORS 55
Being invariant, Jeffreys prior fulfills the relation (3.27). Using (3.31) and
(3.27) we have
1
π(η) ∝ det(I(η)) 2
1
∝ det(JT I(θ)J) 2
1
∝ det(I(θ)) 2 det(J)
∝ π(θ) det(J)

Although, Jeffreys first goal was to find an invariant prior, his prior is also
non–informative in the sense that the prior has no influence because the pos-
terior distribution based on Jeffreys prior coincides approximately with the
likelihood function. Let us explain it in more detail.
Assume that we have an i.i.d. sample x = (x1 , . . . , xn ) from a regular statis-
tical model, which means the Fisher information matrix exists and it holds
that (see Liero and Zwanzig, 2011, Theorem 3.6)

I(θ) = Covθ V (θ|x) and In (θ) = Covθ V (θ|x) = −Eθ J(θ|x) = nI(θ),
(3.32)
where V (θ|x) is the vector of first derivatives of the log likelihood function
and J(θ|x) is the matrix of second derivatives of the log likelihood function.
Let θ be the maximum likelihood estimator with V (θ|x) = 0. Applying the
Taylor expansion about θ we have
1
ln p(x|θ) ≈ ln p(x|θ) + (θ − θ)T V (θ|x) + (θ − θ)T J(θ|x)(θ − θ)
2
1
≈ ln p(x|θ) + (θ − θ) J(θ|x)(θ − θ).
T
2

Approximating J(θ|x) by −nI(θ) we get


n
ln p(x|θ) ≈ ln p(x|θ) − (θ − θ)T I(θ)(θ − θ). (3.33)
2
Thus we obtain an approximation of the likelihood function (θ|x) = p(x|θ)
by
n
(θ|x) ≈ (θ|x) exp − (θ − θ)T I(θ)(θ − θ)
2 (3.34)
n
∝ exp − (θ − θ)T I(θ)(θ − θ) .
2
Calculating the posterior with the Jeffreys prior and (3.34) we get

π(θ|x) ∝ π(θ)(θ|x)
1
∝ det(I(θ)) 2 (θ|x)
1 n
∝ det(I(θ)) 2 exp − (θ − θ)T I(θ)(θ − θ) ,
2
56 CHOICE OF PRIOR
1 −1
which is the kernel of N(θ, n I(θ) ).
Thus Jeffreys prior is precisely the right
weight needed to obtain the same result as in asymptotic inference theory,
where it is shown that (see e.g., van der Vaart, 1998)
√ D
n(θ − θ) −→ N(0, I(θ)−1 ).

The first example we want to calculate Jeffreys prior for, and to compare it
with other approaches, is the binomial model.

Example 3.20 (Binomial distribution)


For X|θ ∼ Bin(n, θ), we have
 
n
ln(p(x|θ)) = ln( ) + x ln(θ) + (n − x) ln(1 − θ)
x

with
x n−x
V (θ|x) = −
θ 1−θ
x n−x
J(θ|x) = − 2 − .
θ (1 − θ)2
Using EX = nθ, we get the Fisher information
nθ n − nθ n
I(θ) = −EJ(θ|x) = + = ,
θ 2 (1 − θ) 2 θ(1 − θ)

and finally the Jeffreys prior

π(θ) ∝ θ− 2 (1 − θ)− 2 .
1 1

This is the kernel of Beta( 12 , 12 ), which is symmetric and gives more weight
to small and large parameters, but less weight to parameters around 12 ; see
Figure 3.12. 2

Example 3.21 (Location model)


We assume X ∈ Rp and that the data generating distribution Pθ belongs to a
location family. In the continuous case the density has the structure

p(x|θ) = f (x − θ), (3.35)



with known density function f (x) ≥ 0, X f (x)dx = 1 and finite positive
definite Fisher information matrix,

1 
I(f ) = f (x)f  (x)T dx,
f (x)
NON-INFORMATIVE PRIORS 57

3.0
Laplace
Jeffreys

2.5
subjective

2.0
prior

1.5
1.0
0.5
0.0

0.0 0.2 0.4 0.6 0.8 1.0

theta

Figure 3.12: Illustration for Examples 3.6, 3.17 and 3.20. Jeffreys prior is Beta( 12 , 12 ),
Laplace non-informative prior is Beta(1, 1) and the subjective prior is Beta(3.8, 3.8),
as determined in Example 3.6.

where f  (x) is the p-dimensional vector of first derivatives. The unknown


parameter θ ∈ Rp is the location parameter and has the same dimension as
X . The Fisher information matrix is calculated by

1
I(θ) = f  (x − θ)f  (x − θ)T dx.
f (x − θ)

Changing the variable of integration by z = x − θ with dz = dx we have that


the Fisher information matrix is independent on θ,

I(θ) = I(f ).

In the location model Jeffreys prior coincides with Laplace non-informative


prior,
π(θ) ∝ const.
For illustration, see Figure 3.13, which is related to the Laplace distribution
La(θ, 1) . Recall that in general the density of a Laplace distribution La(μ, σ)
is given by
1 1
f (x|μ, σ) = exp(− |x − μ|). (3.36)
2σ σ
2
58 CHOICE OF PRIOR

0.5
0.5

0.4
0.4

0.3
prior
0.3

0.2
0.2

0.1
0.1

0.0
0.0

−5 0 5

−5 0 5 location parameter

Figure 3.13: Example 3.21. Jeffreys prior for location parameter. Left: Densities
of the Laplace distribution La(θ, 1) with location parameters −4, −2, 0, 2, 4. Right:
Jeffreys prior.

We now consider a scale family.

Example 3.22 (Scale model)


We assume X ∈ R and that the data generating distribution Pθ belongs to a
scale family, where the unknown parameter θ = σ ∈ R is the positive scale
parameter. In the continuous case, the density has the structure
1 x
p(x|θ) = f ( ), (3.37)
σ σ

where f (x) ≥ 0, f (x)dx = 1, is a known density with derivative f  (x) =
d
dx f (x). Under regularity conditions we have
 
dp(x|θ) d d
dx = p(x|θ)dx = 1 = 0,
dθ dθ dθ
where
dp(x|θ) 1 x  x x
=− 2 f ( ) + f( ) .
dθ σ σ σ σ
Changing the variable of integration z = σx with dz = dx
σ , we have
 
1 
zf (z)dz + 1 = 0,
σ

so that xf  (x)dx = −1.
The Fisher information matrix is calculated by

1 1 x  x x 2
I(θ) = 4 1 x f ( ) + f ( ) dx.
σ σ f(σ )
σ σ σ
NON-INFORMATIVE PRIORS 59
1.0

10
sigma=0.5
sigma=1
0.8

sigma=2

8
0.6

6
prior
0.4

4
2
0.2
0.0

0.2 0.4 0.6 0.8 1.0 1.2 1.4

−6 −4 −2 0 2 4 6 scale parameter

Figure 3.14: Example 3.22. Jeffreys prior for scale parameter. Left: Densities of the
Laplace distribution La(0, σ) with different scale parameters. Right: Jeffreys prior.

By the same change of variable, we have



1 1
(zf  (z) + f (z)) dz
2
I(θ) = 2
σ f (z)
   
1 1 2  2 
= 2 z f (z) dz + 2 zf (z)dz + f (z)dz .
σ f (z)
 
Applying xf  (x)dx = −1 and f (x)dx = 1, we obtain
 
1 1 2  2
I(θ) = 2 z f (z) dz − 1 .
σ f (z)

Hence Jeffreys prior for the scale parameter θ = σ is


1
π(σ) ∝ . (3.38)
σ
Jeffreys prior does not
 ∞coincide with the Laplace non-informative prior, but
it is improper, since 0 x1 dx = ∞. Figure 3.14 illustrates it for density f of
the Laplace distribution La(0, 1). 2

The situation changes when we study the location parameter μ and the scale
parameter σ simultaneously, i.e., θ = (μ, σ). In this case, Jeffreys additionally
requires independence between μ and σ, so that

π(θ) = π(μ)π(σ). (3.39)


60 CHOICE OF PRIOR
Example 3.23 (Location–scale model)
Let us consider a one dimensional sample space and data generating distribu-
tions Pθ , θ = (μ, σ) with density
1 x−μ
p(x|θ) = f( ), (3.40)
σ σ
∞
where f (x) ≥ 0, −∞ f (x)dx = 1. The unknown parameter θ = (μ, σ) consists
of the location parameter μ ∈ R and the scale parameter σ > 0. Under the
independence assumption (3.39), the joint Jeffreys prior is
1 1
π(θ) = π(μ)π(σ) ∝ const ∝ .
σ σ
2

We further illustrate the influence of the independence assumption (3.39) for


the normal distribution.

Example 3.24 (Normal distribution)


Consider the location scale model with
1 x2
f (x) = ϕ(x) = √ exp(− ),
2π 2
i.e., the statistical model is the family of one dimensional normal distributions
N(μ, σ 2 ) where the location parameter μ is the expectation and the scale
parameter σ is the standard deviation: θ = (μ, σ). From
1 1
ln(p(x|θ)) = − ln(2π) − ln(σ) − 2 (x − μ)2 ,
2 2σ
we get the score function
1 1 1
V (μ, σ|x) = ( (x − μ), − + 3 (x − μ)2 )T ,
σ2 σ σ
and the components of the Fisher information
 
1 1
I(μ, σ)11 = Var (X − μ) = 2,
σ2 σ
 
1 1
I(μ, σ)12 = Cov (X − μ), (X − μ) 2
= 0,
σ2 σ3
 
1 1 2
I(μ, σ)22 = Var (X − μ) 2
= 6 (E(X − μ)4 − σ 4 ) = 2 ,
σ3 σ σ
so that ⎛ ⎞
1
0
I (μ, σ) = ⎝ σ2 ⎠.
2
0 σ2
NON-INFORMATIVE PRIORS 61
2
Changing the parametrization to mean and variance η = (μ, σ ), we have the
Jacobian ⎛ ⎞
1 0
J=⎝ ⎠.
0 − 2σ1

 
Applying I μ, σ 2 = JT I (μ, σ) J, we get
⎛ ⎞
1
0
I(μ, σ 2 ) = ⎝ σ ⎠.
2
(3.41)
0 2σ1 4

Without the independence condition (3.39), Jeffreys prior is


1 1
π(μ, σ 2 ) ∝ det(I(μ, σ 2 )) 2 ∝ .
σ3
Under (3.39), we have
1
π(μ, σ 2 ) ∝
.
σ2
Using the scale parametrization we have without condition (3.39),
1 1
π(μ, σ) ∝ det(I(μ, σ)) 2 ∝ .
σ2
This prior is also called left Haar measure. Under (3.39) it is
1
π(μ, σ) ∝ ,
σ
known as right Haar measure. 2

Example 3.25 (Multinomial distribution)


Consider the multinomial model with k cells. The data (n1 , . . . , nk ) are the
observed frequencies  in each cell. The parameter
k consists of the probabilities
k
pi of each cell. Since i=1 pi = 1 and n = i=1 ni , we set θ = (p1 , . . . , pk−1 )
and x = (n1 , . . . , nk−1 ). The probability function is given by


k−1
p(x|θ) ∝ pni i (1 − δ)n−s (3.42)
i=1

k−1 k−1
where δ = i=1 pi and s = i=1 ni . The log–likelihood function is


k−1
ln(p((n1 , . . . , nk−1 )|θ)) = ni ln(pi ) + (n − s) ln(1 − δ) + const
i=1
62 CHOICE OF PRIOR
with the first and second derivatives for i, j = 1, . . . , k − 1
∂ 1 1
ln(p(x|θ)) = ni − (n − s)
∂pi pi 1−δ
∂ 1 1
ln(p(x|θ)) = −ni 2 − (n − s) (3.43)
∂pi pi pi (1 − δ)2
∂ 1
ln(p(x|θ)) = −(n − s) , j = i.
∂pi pj (1 − δ)2

It holds that Eni = npi and E(n − s) = n(1 − δ). Using pk = 1 − δ the Fisher
information is
⎛ ⎞
1 1 1 1
+ . . .
⎜ p1 pk pk pk ⎟
⎜ ⎟
I(θ) = n ⎜ 1 1
p2 + pk
1
... 1
⎟ (3.44)
⎝ pk pk

.. ..
. ... . p 1 + p1
k−1 k

or, alternatively written as


1 1 1
I(θ) = n diag( ,..., ) + n 1k−1 1Tk−1
p1 pk−1 pk

where 1k−1 is the column vector consisting of k − 1 ones, and 1k−1 1Tk−1 is the
(k − 1) × (k − 1) matrix consisting of ones. Using the rule

det(A + aaT ) = det(A)(1 + aT A−1 a)

we obtain

k−1
1  1
k−1 k
1
det(I(θ)) = nk−1 (1 + pi ) = nk−1 .
i=1
pi i=1
p k p
i=1 i

k
Hence Jeffreys prior for p1 , . . . , pk with i=1 pi = 1 is
− 12 − 12
πJeff (p1 , . . . , pk ) ∝ p1 . . . pk (3.45)

which is the Dirichlet distribution Dirk (0.5, . . . , 0.5); see (3.15). 2

In the following table, we summarize the Jeffreys priors for some commonly
used distributions.
NON-INFORMATIVE PRIORS 63

Distribution Jeffreys prior Posterior


p(x|θ) π(θ) ∝ π(θ|x)
Normal Normal
2
N(θ, σ ) const N(x, σ 2 )
Poisson Gamma
Poi(θ) √1 Gamma(x + 0.5, 1)
θ
Gamma Gamma
Gamma(ν, θ) √1 Gamma(ν, x)
θ
Binomial Beta Beta
Bin(n, θ) Beta(0.5, 0.5) Beta(0.5 + x, 0.5 + n − x)
Negative Binomial Beta
− 12 − 12
NB(m, θ) θ (1 − θ) Beta(m, x − 1), for x > 1
Normal InvGamma
2
N (μ, θ) 1
θ InvGamma(1, (μ−x)
2 )

3.3.3 Reference Priors


Reference priors were introduced by Bernardo in Bernardo (1979). They be-
came more and more popular, and the name Reference Analysis is used for a
Bayesian inference based on reference priors, see Bernardo (2005). In Berger
and Bernardo (1992a), an algorithm for the construction of reference priors is
given, also named Berger–Bernardo method, see Kass and Wasserman (1996).
A formal definition of a reference prior was first given in Berger et al. (2009).
Bernardo’s main contribution was to define the concept of non-information in
a mathematical way. To describe his approach, consider the Kullback–Leibler
divergence   
π(θ|x)
K(π(.|x)|π) = π(θ|x) ln dθ, (3.46)
π(θ)
which describes the gain of information using the posterior distribution π(.|x)
instead of the prior π. It is the information coming from the experiment. To
make it independent of the data x, Lindley (1956) proposed the expectation
of the Kullback–Leibler divergence as expected information. It depends on the
statistical model P = {Pθ : θ ∈ Θ} and on the prior π,
 
I(P, π) = p(x)K(π(.|x)|π)dx, where p(x) = p(x, θ)dθ. (3.47)
64 CHOICE OF PRIOR
The expected information is equal to the mutual information, which measures
the gain of information using p(x, θ) instead of p(x)π(θ). It holds that p(x, θ) =
p(x)π(θ|x) = π(θ)p(x|θ) and
  
p(x, θ)
I(P, π) = p(x, θ) ln dθ dx.
p(x)π(θ)

Now the argument is that a non-informative prior should have almost no


influence, which in turn means that the information from the experiment
should be maximal. Define the Shannon entropy of an arbitrary continuous
distribution P with density p(x) by

H(P) = − p(x) ln(p(x))dx, (3.48)

recall that in (3.25) the entropy is defined for the discrete case. The expected
information can be presented as the difference of the entropy of the prior
and the expected entropy of the posterior. By using (3.46) and π(θ|x)p(x) =
π(θ)p(x|θ) we get
   
π(θ|x)
I(P, π) = p(x) π(θ|x) ln dθ dx
π(θ)
 
=− π(θ|x)p(x) ln(π(θ))dx dθ + p(x)π(θ|x) ln(π(θ|x))dx dθ
  
= − π(θ) ln(π(θ)) p(x|θ)dx dθ + p(x)π(θ|x) ln(π(θ|x))dx dθ
  
= − π(θ) ln(π(θ))dθ + p(x) π(θ|x) ln(π(θ|x))dθ dx

We have 
I(P, π) = H(π) − p(x)H(π(.|x)dx. (3.49)

Thus a prior which generates a large expected information I(P, π) corresponds


to a prior with large entropy and related posterior with small expected entropy.
For illustration we consider the formulae in case of a binomial distribution with
a beta distribution as prior.

Example 3.26 (Binomial distribution)


We consider X|θ ∼ Bin(n, θ) and θ ∼ Beta(α, β). Then the posterior is
Beta(α + x, β + n − x), see Example 2.9. The Kullback–Leibler divergence
from the prior to the posterior is the divergence between the two beta
NON-INFORMATIVE PRIORS 65
distributions, i.e.,

K(Beta(α + x, β + n − x)|Beta(α, β))


=
 
B(α, β)
ln
B(α + x, β + n − x)
+ xΨ(α + x) + (n − x) Ψ(β + n − x) − n Ψ(α + β + n),

where Ψ is the digamma function


 1
1 − xα−1
dx = Ψ(α) − Ψ(1),
0 1−x

or alternatively
d
Ψ(x) = ln(Γ(x)).
dx
The marginal distribution of X is calculated in (3.5). For illustration let us
consider as candidate class C of prior distributions on Θ all symmetric beta
priors with α = β. Then the marginal distribution in (3.5) fulfills p(k) =
p(n − k) and the expected information can be written as function of α as

I(α) = ln(B(α, α)) − Ex ln(B(α + x, α + n − x) + 2Ex xΨ(α + x) − nΨ(2α + n).


(3.50)
In Figure 3.15 the information I(α) is plotted, using the beta and digamma
functions defined in R. Let us compare the prior Beta(αp , αp ) with αp =
arg min I(α) to the prior with maximal entropy. The entropy of Beta(α, β)
distribution is given by

H(Beta(α, β)) = ln(B(α, β)) − (α − 1)Ψ(α) − (β − 1)Ψ(β) + (α + β − 2)Ψ(α + β)

H(Beta(α, β)) is non-positive for all α, β with maximal value zero attained
for α = 1, β = 1, where B(1, 1) = 1. The prior Beta(1, 1) is equal to the
uniform distribution U[0, 1], which is also the Laplace non-informative prior;
see Example 3.20. 2

Unfortunately the requirement to take the prior which maximizes the expected
information I(P, π) does not give tractable results in all cases (see the discus-
sion in Berger and Bernardo, 1992a). Bernardo (1979) proposed to compare
the prior information with the theoretical best posterior information, which
he described as the limit information for n → ∞ from an experiment with n
repeated independent copies of the original experiment P n = {P⊗n θ : θ ∈ Θ}.
66 CHOICE OF PRIOR

3.5

n=1000

3.0
2.5
● n=200
Information

2.0
1.5

● n=20
1.0


n=5
0.5
0.0

0.0 0.2 0.4 0.6 0.8 1.0 1.2

alpha

Figure 3.15: Example 3.27. The expected information I(α) in (3.50) has its maximum
at α = 0.142 for n = 5, α = 0.272 for n = 20, α = 0.41 for n = 200 and α = 0.45 for
n = 1000.

Example 3.27 (Binomial distribution)


⊗n
n Consider P = {Bin(m, θ) : θ ∈ [0, 1]}. A
n
We continue with Example 3.26.
sufficient statistic is tn (X) = i=1 Xi , with tn (X) ∼ Bin(n m, θ). Thus we can
use the formulae in Example 3.26 with increasing n. In Figure 3.15 the infor-
mation I(α) is plotted for different n. We see the symmetric non-informative
prior depends on n and approaches Beta(0.5, 0.5), which is Jeffreys prior; see
Example 3.20. 2

In Clarke and Barron (1994) the authors derive an asymptotic expansion of


the mutual information I(P, π) under regularity conditions on the statistical
model P. For all positive continuous priors π, supported on a compact subset
K, it holds that
p n
I(P, π) = ln + ln c − K(π|π ∗ ) + rest(n), (3.51)
2 2πe
where π ∗ is Jeffreys prior given by
 
∗ 1
π (θ) = det I(θ), with c = det I(θ)dθ, K ⊂ Θ ⊂ Rp compact.
c K
NON-INFORMATIVE PRIORS 67
The remainder terms fulfills limn→∞ rest(n) = 0. The regularity conditions
on the statistical model P include the existence and positive–definiteness of
the Fisher information I(θ) and that
 
∂2
I(θ) = K(Pθ |Pθ )|θ=θ .
∂θi ∂θj
i=1,...,p,j=1,...,p

The first term in (3.51), p2 ln 2πe


n
, is the entropy of Np (0, Ip ). The derivation
of (3.51) is based on similar background as that of Jeffreys prior, see (3.30).
The Kullback–Leibler divergence K(P|Q) is non–negative and zero only for
P = Q. Thus under regularity conditions the Jeffreys prior is the reference
prior.
The formal definition of Berger et al. (2009) can be presented as follows.

Definition 3.5 (Reference Prior) A function π(θ) depending on the


statistical model P = {Pθ : θ ∈ Θ} and a class of prior distributions C
on Θ is a reference prior iff

1. For all x ∈ X , Θ π(θ)p(x|θ)dθ < ∞.
2. There exists an increasing sequence of compact subsets {Θj }j=1,...,∞
such that ∪∞ j=1 Θj = Θ, and the posterior πj (θ|x) defined on Θj fulfills
for all x ∈ X
lim K(π(.|x)|πj (.|x)) = 0,
j→∞

where π(θ|x) ∝ π(θ)p(x|θ).


3. For any compact set Θ0 ⊆ Θ, and for any p ∈ C let π0 and p0 be the
truncated distributions of π and p defined on Θ0 , such that

lim (I(P n , π0 ) − I(P n , p0 )) ≥ 0.


n→∞

Definition 3.5 is formulated in as general term as possible. Condition 1 in-


cludes improper priors, but the related posterior has to be proper. Condition
2 involves a construction of the reference prior as limit of priors on compact
subsets; also this condition is a tool for handling improper priors. Condition
3 requires that asymptotically the reference prior should influence the infor-
mation of the experiment as less as possible.

Definition 3.5 does not provide a construction of reference priors. The follow-
ing theorem from Berger et al. (2009) provides an explicit expression.
68 CHOICE OF PRIOR

Theorem 3.5
Assume P = {Pθ : θ ∈ Θ ⊆ R} and a class of prior distributions C on
Θ. For all p ∈ C and all θ ∈ Θ, p(θ) > 0 and Θ p(θ)p(x|θ)dθ < ∞. Let
P n = {P⊗nθ : θ ∈ Θ} and tn be a sufficient statistic of P n . Let π ∗ (θ) be
a strictly positive function such that the posterior π ∗ (θ|tn ) is proper and
asymptotically consistent. For any θ0 in an open subset of Θ, define
 
fn (θ)
fn (θ) = exp p(tn |θ) ln(π ∗ (θ|tn )) dtn , f (θ) = lim . (3.52)
n→∞ fn (θ0 )

If
1. for all n, fn (θ) is continuous,
2. the ratio ffnn(θ
(θ)
0)
is either monotonic in n or bounded by an integrable
function and
3. conditions 1 and 2 of Definition 3.5 are fulfilled,
then f (θ) is a reference prior.

Proof: See (Berger et al., 2009, Appendix F).


2
The condition that the posterior π ∗ (θ|tn ) has to be asymptotically consistent
is defined in Chapter 5. The following example illustrates an application of
Theorem 3.5.

Example 3.28 (Uniform distribution)


Consider an i.i.d. sample from U[θ − 12 , θ + 12 ], θ ∈ R. It is a location
model, but it does not fulfill the regularity conditions, because the sup-
port of the distribution depends on the parameter. The likelihood function is
(θ|x1 , . . . , xn ) = I[xmax −0.5,xmin +0.5] (θ), with sufficient statistic t = (t1 , t2 ) =
(xmin , xmax ), where xmin = min(x1 , . . . , xn ) and xmax = max(x1 , . . . , xn ).
Applying Theorem 3.5, we start with π ∗ = 1. Then the related posterior
π ∗ (θ|t) ∝ I[t2 −0.5,t1 +0.5] (θ) is the uniform distribution U[t2 − 0.5, t1 + 0.5], i.e.,
1
π ∗ (θ|t) = I[t −0.5,t1 +0.5] (θ).
1 + t 2 − t1 2
The statistic r = t2 − t1 is the range of the distribution which is location–
invariant. In this case we have xmax = umax − θ + 0.5 and xmin = umin −
θ + 0.5 where umax is the maximum and umin is the minimum of inde-
pendent identically U[0, 1] distributed random variables. Hence fn (θ) =
1 1
exp 0 p(r) ln( 1+r ) dr is independent of θ and the ratio ffnn(θ (θ)
0)
= 1. We
obtain the reference prior π(θ) = 1, which is the Laplace prior. Note that, the
prior is improper, but the posterior U[xmax − 0.5, xmin + 0.5] is proper. 2
NON-INFORMATIVE PRIORS 69
The argument in Example 3.28 can be applied to all location models. In Exam-
ple 3.21, the regular location model was studied with the result that Jeffreys
prior equals Laplace prior.

For cases where the integrals and the limit require complex calculations,
Berger et al. (2009) proposed the following algorithm for the numerical com-
putation of the reference prior. The algorithm includes Monte Carlo steps for
the integrals in (3.52) and an approximation of the limit, see also Chapter 9.

Algorithm 3.1 Reference prior


1. Starting Values:
(a) Choose moderate value k (simulate the limit in (3.52)).
(b) Choose an arbitrary positive function π ∗ (θ) (take π ∗ (θ) = 1).
(c) Choose moderate value m for the Monte Carlo steps.
(d) Choose θ values for which (θ, π(θ)) is required, θ(1) , . . . , θ(M ) .
(e) Choose an arbitrary interior point θ0 ∈ Θ.
2. For each θ ∈ {θ0 , θ1 , . . . , θM }:
(a) For each j with j = 1, . . . , m
i. Draw independently xj,1 , . . . , xj,k from p(·|θ).
ii. Compute numerically:
 
k
cj = p(xj,i |θ)π ∗ (θ)dθ
Θ i=1

A. Draw independently θ1 , . . . , θm from π ∗ (·).


k
B. Calculate ph = i=1 p(x j,i |θh ) for h = 1, . . . , m.
1 m
C. Approximate cj by m h=1 ph .
iii. Calculate
1 
k
rj (θ) = ln( p(xj,i |θ)π ∗ (θ)).
cj i=1
1
m
(b) Compute fk (θ) = exp( m j=1 rj (θ)).
fk (θ)
3. Store (θ, π(θ)) with π(θ) = fk (θ0 ) .

The other innovation in Bernardo (1979), besides the information–theoretic


foundation of a non-informative prior, is his stepwise procedure for handling
nuisance parameters. In the following we present this procedure and give an
illustrative example.

Suppose the parameter


θ = (ω, λ)
70 CHOICE OF PRIOR
consists of a parameter of interest ω and a nuisance parameter λ. The main
idea is to derive a reference prior for the nuisance parameter given the pa-
rameter of interest, π(λ|ω), and then to eliminate the nuisance parameter by
considering the marginal model,

p(x|ω) = p(x|ω, λ)π(λ|ω)dλ. (3.53)
Λ

Algorithm 3.2 Reference prior with nuisance parameter

1. Split the parameter θ = (ω, λ) into the parameter of interest ω and the
nuisance parameter λ.
2. For each parameter of interest ω, derive the reference prior π(λ|ω) related
to the statistical model {p(x|ω, λ) : λ ∈ Λ}.
3. Use π(λ|ω) to eliminate the nuisance parameter by

p(x|ω) = p(x|ω, λ)π(λ|ω) dλ.
Λ

4. Calculate the reference prior π(ω) related to the statistical model


{p(x|ω) : ω ∈ Ω}.
5. Propose as reference prior for the whole parameter

π(θ) = π(λ|ω)π(ω).

Note that, in general an exchange of parameter of interest with the nuisance


parameter delivers a different reference prior. Under regularity conditions, this
procedure describes a stepwise calculation of Jeffreys prior. In the following
example, taken from Polson and Wasserman (1990), we illustrate this method
and compare the reference prior with the joint Jeffreys prior.

Example 3.29 (Bivariate Binomial)


Consider the following bivariate binomial model

X1 ∼ Bin(n, p) and X2 |X1 ∼ Bin(X1 , q).

The observations are x = (x1 , x2 ). The parameter has two components θ =


(p, q). It holds that
   
n x1 x1 x2
p(x|θ) = p (1 − p)n−x1 q (1 − q)x1 −x2 . (3.54)
x1 x2

The parameter of interest is ω = p and the nuisance parameter is λ = q. We


NON-INFORMATIVE PRIORS 71
have
{p(x|ω, λ) : λ ∈ Λ} = {Bin(X1 , q) : q ∈ (0, 1)}.
The binomial model is a regular model, and the reference prior coincides with
Jeffreys prior, see Example 3.20. We obtain π(λ|ω) as the beta distribution
Beta(0.5, 0.5), which is independent of ω = p. Then the marginal model (3.53)
is calculated as
p(x|p)
 1   
n x1 x1 x2 1
q (1 − q)x1 −x2 − 12
(1 − q)− 2 dq
1
= p (1 − p)n−x1 1 1 q
0 x 1 x 2 B( ,
2 2 )
   
n x1 x 1 B(x 2 + 1
, x 1 − x 2 + 1
)
= p (1 − p)n−x1 2 2
,
x1 x2 B( 12 , 12 )
(3.55)

where B(a, b) is the beta function. This model is regular. Using


p
ln(p(x|p)) = x1 ln( ) + n ln(1 − p) + const
1−p
we obtain the Fisher information
n
I(p) = .
p(1 − p)
Thus Jeffreys prior is the beta distribution, Beta(0.5, 0.5). Summarizing, we
have the reference prior

π(θ) = π(p)π(q) ∝ p− 2 (1 − p)− 2 q − 2 (1 − q)− 2 ,


1 1 1 1

where π(p) and π(q) are the beta distributions, Beta(0.5, 0.5). Let us now
compare the reference prior with the Jeffrey prior for θ = (p, q). Using (3.54)
we have the score function
 T
x1 − np x2 − x1 q
V (θ|x) = , = (V1 (p), V2 (q))T .
p(1 − p) q(1 − q)
and the second derivatives
d x1 (2p − 1) − np2 d x2 (2q − 1) − x1 q 2
V1 (p) = and V2 (q) = ,
dp p (1 − p)
2 2 dq q 2 (1 − q)2
d d
where dq V1 (p) and dp V2 (q) are zero. Further,

EX1 = np and EX2 = E(E(X2 |X1 )) = E(X1 q) = npq.

Applying (3.32) we get


⎛ ⎞
1
0
I(θ) = n ⎝ p(1−p) ⎠,
p
0 q(1−q)
72 CHOICE OF PRIOR
and
πJeff (θ) ∝ det(I(θ)) 2 ∝ p− 2 (1 − p)− 2 p 2 q − 2 (1 − q)− 2 ,
1 1 1 1 1 1

i.e., the Jeffreys prior is

πJeff (θ) ∝ (1 − p)− 2 q − 2 (1 − q)− 2 .


1 1 1

Note that, it is not a product of two beta distributions, but it is still a proper
prior. 2

Berger and Bernardo (1992a) proposed an iterative algorithm which is now


named Berger–Bernado method in the literature. We give the main steps and
an illustrative example. The p-dimensional parameter is separated in m ≤ p
groups. It is recommended to do it in the order of interest. The first group
includes the parameters of main interest. The idea is to iteratively eliminate
the parameter groups as in the method above. We introduce the following
notations.
θ = (θ1 , . . . , θp ) = (θ(1) , . . . , θ(m) ) = (θ[j] , θ[∼j] ),
θ[j] = (θ(1) , . . . , θ(j) ) and θ[∼j] = (θ(j+1) , . . . , θ(m) ) (3.56)
j = 1, . . . , m, θ[0] = 1, θ[∼0] = θ.

Algorithm 3.3 Berger–Bernardo method

1. For j = m, m − 1, . . . , 1:
(a) Suppose the current state is π(j+1) (θ[∼j] |θ[j] ).
(b) Determine the marginal model

p(x|θ[j] ) = p(x|θ)π(j+1) (θ[∼j] |θ[j] ) dθ[∼j] .

(c) Determine the reference prior h(j) (θ(j) |θ[j−1] ) related to the model

{p(x|θ[j] ) : θ(j) ∈ Θj },

where the parameters θ[j−1] are considered as given.


(d) Compute π(j) (θ[∼(j−1)] |θ[j−1] ) by

π(j) (θ[∼(j−1)] |θ[j−1] ) ∝ π(j+1) (θ[∼j] |θ[j] )h(j) (θ(j) |θ[j−1] ).

2. Take π(θ) := π(1) (θ[∼0] |θ[0] ), as reference prior.


NON-INFORMATIVE PRIORS 73
The following example is an illustration of the Berger–Bernardo method. It is
taken from Berger and Bernardo (1992b).

Example 3.30 (Multinomial distribution)


For simplicity we set k = 4 and use groups with single parameters. The multi-
nomial distribution Mult(n, p1 , . . . , p4 ) is a discrete distribution with proba-
bility
n!
P(n1 , n2 , n3 , n4 ) = pn1 . . . pn4 4 , (3.57)
n1 ! . . . n4 ! 1
where n1 + n2 + n3 + n4 = n and p1 + p2 + p3 + p4 = 1, thus p4 = 1 − δ
with δ = p1 + p2 + p3 . We have x = (n1 , n2 , n3 ) and θ = (p1 , p2 , p3 ), with
θ(1) = p1 , θ(2) = p2 , θ(3) = p3 . The first step is to calculate the conditional
reference prior for p3 given p1 , p2 . This means we derive the reference prior
for the model

{p(x|θ) : θ1 = p1 , θ2 = p2 , θ3 ∈ (0, 1 − (p1 + p2 ))},

where p(x|θ) is given in (3.57). This conditional model is regular. The reference
prior is Jeffreys prior, given in (3.45), constrained on (0, 1 − (p1 + p2 )),
−1
π(p3 |p1 , p2 ) ∝ p3 2 (1 − δ)− 2 , 0 < p3 < 1 − (p1 + p2 ).
1

In order to get the constant


 1−(p1 +p2 )
−1
p3 2 (1 − δ)− 2 dp3
1
c=
0

we apply the integral


 1−d
(1 − d − x)a xb dx = (1 − d)a+b+1 B(a + 1, b + 1), a > −1, b > −1, (3.58)
0
1
where B(a + 1, b + 1) = 0 xa (1 − x)b dx is the beta function. Note that, we
x
obtain the integral (3.58) by changing the variables to y = 1−d and from the
definition of the beta function. The constant c = B(0.5, 0.5) is independent
on p1 , p2 . Thus the constrained conditional Jeffreys prior is
−1
π(p3 |p1 , p2 ) = B(0.5, 0.5)−1 p3 2 (1 − δ)− 2 .
1

The second step is to determine the marginal model. We have to calculate



p(x|θ1 , θ2 ) = p(x|θ)π(θ3 |θ1 , θ2 ) dθ3 .

It holds that
 1−(p1 +p2
n − 12
(1 − δ)n4 − 2 p3 3
1
p(x|θ1 , θ2 ) ∝ pn1 1 pn2 2 dp3 .
0
74 CHOICE OF PRIOR
Using (3.58) again, the marginal model is

p(x|θ1 , θ2 ) ∝ pn1 1 pn2 2 (1 − (p1 + p2 ))(n−n1 −n2 ) ,

which is the multinomial distribution Mult(n, p1 , p2 , 1 − (p1 + p2 )). We get the


conditional reference prior for p2 given p1 from (3.45), as
−1
π(p2 |p1 ) ∝ p2 2 (1 − p1 − p2 )− 2 , 0 < p2 < 1 − p1 .
1

The marginal model, after eliminating p2 , is

p(x|θ1 ) ∝ pn1 1 (1 − p1 )n−n1 ,

which is the binomial distribution, Bin(n, p1 ), and the reference prior is Jeffreys
prior
−1
π(θ1 ) ∝ p1 2 (1 − p1 )− 2 .
1

This gives
−1 −1 −1
π(p1 , p2 , p3 ) ∝ p1 2 (1 − p1 )− 2 p2 2 (1 − (p1 + p2 ))− 2 p3 2 (1 − (p1 + p2 + p3 ))− 2 .
1 1 1

3.4 List of Problems


1. Consider the following statistical model on customer satisfaction. In a
query, the customer can rate between satisfied (+), disappointed (-), ok
(+/-), no answer (0). The parameter θ is the level of satisfaction 0, 1, 2,
where 2 means the customer is satisfied. The probability Pθ (x) is given in
the following table:

x + +/− − 0
θ=2 0.6 0.1 0 0.3
θ=1 0.1 0.2 0.1 0.6
θ=0 0 0.2 0.79 0.01

From earlier studies it is known that the probability of high satisfaction


(θ = 2) is 0.2 and the probability of no satisfaction (θ = 0) is 0.3. Consider
the result of a single query.
(a) Calculate the posterior distribution, using the information from earlier
studies.
(b) Determine the prior with highest entropy and calculate the correspond-
ing posterior distribution.
(c) Determine the prior with π(0) = 2π(1) and highest entropy.
LIST OF PROBLEMS 75
(d) Determine the prior with π(0) = π(1) and prior expectation 1.
2. Consider an i.i.d. sample from a Lognormal(μ, σ 2 ) distribution with density
1 1
f (x|θ) = √ exp(− 2 (ln(x) − μ)2 ), −∞ < μ < ∞, σ > 0, x > 0.
σx 2π 2σ
Set σ = 1. The parameter of interest is θ = μ. We guess that the parameter
μ is lying symmetrically around 3. We have only vague knowledge and
assume as prior a Cauchy distribution C(m, γ). The hyperparameters are
m, γ.
(a) Determine the location parameter m of the prior.
(b) Further require that the prior probability P(μ > 10) > 0.3. Determine
the hyperparameter γ.
(c) Is the prior conjugate?
(d) Derive Jeffreys prior.
3. Consider an i.i.d. sample X1 , . . . , Xn from a Pareto distribution P(α, μ),
with
μα
f (x | α, μ) = α α+1 I[μ,∞) (x) , α > 0, μ ∈ R
x
(a) Set θ = (α, μ). Does the distribution belong to an exponential family?
(b) Set α = 1. Does the distribution belong to an exponential family?
(c) Set μ = 1. Does the distribution belong to an exponential family?
(d) Set μ = 1. Derive a conjugate prior for α.
4. Consider an i.i.d sample X1 , . . . , Xn from a geometric distribution Geo(θ),

Pθ (k) = (1 − θ)k θ.

(a) Does the sample distribution belong to an exponential family? Determine


the sufficient statistic and the natural parameter.
(b) Apply Theorem 3.3 to derive a conjugate family.
(Hint: Y = g(X), fY (y) = fX (g −1 (y)) | dy
d −1
g (y) |.)
(c) Which family of distributions is this conjugate family?
(d) Give the conjugate posterior distribution π(θ | x1 , . . . , xn ).
(e) Derive the Fisher information.
(f) Derive Jeffreys prior.
(g) Does Jeffreys prior belong to the conjugate family?
5. Consider an i.i.d. sample X = (X1 , . . . , Xn ) from Gamma(α, β). The pa-
rameter of interest is θ = (α, β).
(a) Is the distribution of the sample a member of an exponential family?
Determine the natural parameters and the statistics T (x).
(b) Determine a conjugate prior for θ.
(c) Determine the corresponding posterior.
76 CHOICE OF PRIOR
(d) Is the conjugate family a known distribution family? Is the conjugate
family an exponential family?
6. Consider X|θ ∼ Bin(n, θ). An alternative parameter η is the odds ratio
θ
η = 1−θ .
(a) Consider the prior π(η) ∝ η −1 for the odds ratio and derive the prior for
the success probability θ.
(b) Is the prior for θ in (a) proper? Is the related posterior proper?
(c) Set as prior for θ the beta distribution Beta(a, b). Derive the prior for
ξ = ab η. Is it a well-known distribution?
7. Suppose X|θ ∼ Bin(n, θ).
(a) Is the distribution of X a member of an exponential family? Determine
the natural parameter.
(b) Determine a conjugate prior for the natural parameter.
(c) Determine the corresponding posterior.
(d) Is the conjugate family a known distribution family?
8. Consider the multinomial distribution Mult(n, p1 , . . . , p3 ) given in (3.12)
with k = 3. We are interested in the parameters η1 = pp13 and η2 = pp23 .
Derive Jeffreys prior πJeff (η).
9. The Hardy–Weinberg model states that the genotypes AA, Aa and aa occur
with following probabilities:
pθ (aa) = θ2 , pθ (Aa) = 2θ(1 − θ), pθ (AA) = (1 − θ)2 , (3.59)
where θ is an unknown parameter in Θ = (0, 1).
(a) Does the distribution in (3.59) belong to an exponential family?
(b) Determine the sufficient statistics and the natural parameter η.
(c) Derive a conjugate family for η.
(d) Determine Jeffreys prior for η.
(e) Does the posterior related to Jeffreys prior belong to that conjugate
family?
10. Consider X ∼ N(θ, 1).
(a) Recall Jeffreys prior and the related posterior πJeff (θ|x).
k
(b) Derive Jeffreys prior πJeff k
and the related posterior πJeff (θ|x) for θ ∈
[−k, k].
(c) Show that for all x ∈ R the Kullback–Leibler divergence converges:
k
lim K(πJeff (.|x)|πJeff (.|x)) = 0.
k→∞

(d) Determine the reference prior p0 (θ) by


k
πJeff (θ)
p0 (θ) = lim , θ0 ∈ [−k, k].
k→∞ π k (θ0 )
Jeff

Does the reference prior depend on the choice of θ0 ?


LIST OF PROBLEMS 77
11. Consider X ∼ N(μ, σ ) with density ϕμ,σ2 (x). Calculate the Shannon
2

entropy H(N(μ, σ 2 )) = − ϕμ,σ2 (x) ln(ϕμ,σ2 (x))dx.


12. Consider an i.i.d. sample from X ∼ N(θ, σ 2 ) and two priors π1: N(0, σ02 )
and π2: N(μ0 , λσ02 ), λ > 0.
(a) Compute the related posteriors πj (θ|t(x)), j = 1, 2, where t(x) is the
sufficient statistic.
(b) Calculate expected information I(P n , π1 ) and I(P n , π2 ).
(c) Calculate
lim (I(P n , π1 ) − I(P n , π2 )) .
n→∞

(d) For which λ and μ0 is the second prior less informative than the first?
Chapter 4

Decision Theory

This chapter provides an introduction to decision theory, especially the


Bayesian decision theory. The main goal is to give a deeper understanding
of Bayesian inference. Readers who are mainly interested in methods and ap-
plications, can skip this chapter.
We want to explain the connection between Bayesian inference based on the
model {P, π} and frequentist inference based on the model {P}.
Bayes methods are based on the posterior distribution, but nevertheless it
makes sense to apply them outside of the Bayes model. Then the question
arises: Which optimal properties the Bayes method can have? One answer is:
For a special choice of worst case prior π0 , the Bayes method can be minimax
optimal.
Otherwise we can also express frequentist methods as Bayesian and derive
optimal properties for them.
This chapter is mainly based on Robert (2001, Chapter 2) and Liese and
Miescke (2008, Chapter 3).

4.1 Basics of Decision Theory


We begin with the main ingredients. First, we consider the statistical model
defined in (2.1),
P = {Pθ : θ ∈ Θ},
where θ = (θ1 , . . . , θp ). Second, we introduce a decision space denoted by
D. In case we want to find out the underlying parameter from data x we set
D = Θ ⊆ Rp . When we are only interested in θ1 , the first component of θ, then
we take D = Θ(1) ⊆ R. We may also be interested to predict a future data
point x ∈ Xf . Then we set D = Xf . For testing problem the decision space is
D = {0, 1}; for classification problem or model choice, it is D = {1, . . . , k}.

An element d ∈ D is called decision. The main purpose of statistical inference


is to make a decision based on the data x. Formulated as decision rule, it is
defined as
δ : x ∈ X → δ(x) = d ∈ D.

DOI: 10.1201/9781003221623-4 78
BASICS OF DECISION THEORY 79

Figure 4.1: The Glienicker Bridge as Laplace–Bayes bridge.

Depending on the inference problem it can be an estimator, a predictor, a test


or a classification rule. The third ingredient is the evaluation of a decision,
which is determined by a loss function, defined as
L : (θ, d) ∈ Θ × D → L(θ, d) ∈ R+ . (4.1)
The loss L(θ, d) is the penalty, when θ is the true parameter and the decision is
d. A high loss means a bad decision. In general a loss can be defined arbitrarily,
but it must satisfy some reasonable properties. For instance, for Θ = D = R,
L(θ, d) = L(d, θ)
(4.2)
L(θ, d) is increasing if |θ − d| is increasing,
see Figure 4.2. Important loss functions are the Laplace or L1 loss
L1 (θ, d) = |θ − d|
and the Gaussian or L2 loss
L2 (θ, d) = |θ − d|2 .
We illustrate the meaning of a loss function with help of the toy example on
lion’s appetite introduced in Example 2.4.
80 DECISION THEORY

20
L1
L2

15
absolute error
loss

10
5
0

−4 −2 0 2 4

theta−d

Figure 4.2: Illustration of different loss functions.

Example 4.1 (Lion’s appetite) The appetite of a lion has three different
stages:
θ ∈ Θ = {hungry, moderate, lethargic} = {θ1 , θ2 , θ3 }.
It would be much more dangerous to classify a hungry lion as lethargic than
the other way around. We consider the following losses:

L(θ, d) d = θ1 d = θ2 d = θ3
θ1 0 0.8 1
θ2 0.3 0 0.8
θ3 0.3 0.1 0
2

Applying the loss on a decision rule gives L(θ, δ(x)) which is data dependent.
The quality of a decision rule should not depend on the observed data, but on
the data generating probability distribution. We have to consider the loss as
a random variable and to study the distribution of L(θ, δ(X)) under X ∼ Pθ .
In decision theory we are mainly interested in the expectation. We define the
risk as the expected loss

R(θ, δ) = L(θ, δ(x)) p(x|θ) dx. (4.3)
X
BASICS OF DECISION THEORY 81
The risk in (4.3) is often called frequentist risk, because it is based on the
statistical model P only. We consider decision rule δ1 better than decision rule
δ2 iff
R(θ, δ1 ) ≤ R(θ, δ2 ) for all θ ∈ Θ.
Since R(θ, δ1 ) should be compared over the entire parameter space, there are
risks that are not comparible. We illustrate it with the toy example.

Example 4.2 (Lion’s appetite)


Continuing with Example 4.1, we compare the following three decision rules
by their risk.
x 0 1 2 3 4
δ1 θ3 θ3 θ2 θ2 θ1
δ2 θ3 θ2 θ2 θ1 θ1
δ3 θ1 θ1 θ1 θ1 θ1
The first and second rule take the observations into account, the higher x the
more dangerous the lion is. The third rule always decides that the lion was
hungry independent of the observation. We calculate the risk by R(θ, δ) =
5
i=1 L(θ, δ(xi ))Pθ (xi ), where Pθ (xi ) is given in Example 2.4. We obtain

R(θ, δ) θ1 θ2 θ3
δ1 0.73 0.08 0.005
δ2 0.08 0.07 0.01
δ3 0 0.3 0.3

As it seems, none of the rules can be preferred; see Figure 4.3. 2

In general we cannot find an optimal decision rule which is better than all the
other decision rules. The way out is that we search for the rule for which we
cannot find a better one.

Definition 4.1 (Admissible decision) Assume the statistical model P,


the decision space D and the loss function L. A decision rule δ0 is called
inadmissible iff there exists a decision rule δ1 such that

R(θ, δ0 ) ≥ R(θ, δ1 ) for all θ ∈ Θ


(4.4)
R(θ0 , δ0 ) > R(θ0 , δ1 ) for at least one θ0 ∈ Θ.

Otherwise the decision rule δ0 is called admissible.


82 DECISION THEORY

0.8

0.6
risk

0.4

● ●
0.2

● ●

0.0

● ●

1.0 1.5 2.0 2.5 3.0

theta

Figure 4.3: Example 4.2. No decision rule has a smaller risk for all parameters. It is
impossible to compare them by the frequentist risk.

The admissibility property is weaker than the requirement of minimal risk for
all parameters. We have a better chance to find an admissible decision rule
than an optimal one. Already from this type of definition it becomes clear
that it is easier to prove inadmissibility than the opposite. We will see that
most of the proofs in this field are indirect proofs.
We conclude this section with a prominent example on the James–Stein esti-
mator, defined in James and Stein (1960).

Example 4.3 (James–Stein estimator)


Let X1 , . . . , Xd be independent r.v., Xi ∼ N(θi , σ 2 ), where θ = (θ1 , . . . , θd ) ∈
Rd is unknown. The variance σ 2 is known, say σ 2 = 1. Then the unbiased,
natural estimator for θ is Tnat (x) = (x1 , . . . , xd )T . The surprising fact here
is that for d > 2 it is inadmissible. James and Stein (1960) introduced a
shrinkage estimator for d > 2, as
 
d−2
SJS (x) = 1 − x. (4.5)
x 2

For the loss function L(θ, d) = θ − d 2 it can be shown that the James–
Stein estimator is better. We consider main steps of the proof. Comparing the
BAYESIAN DECISION THEORY 83
estimators Tnat (x) = x and SJS we get
D = R(θ, Tnat ) − R(θ, SJS )
d    
d−2 (d − 2)2 (4.6)
=2 Eθ (xj − θj ) xj − Eθ .
j=1
x 2 x 2

Using repeated integration by parts, it can be shown that


   
d−2 (d − 2)2
Eθ (xj − θj ) xj = Eθ .
x 2 x 2
Thus  
(d − 2)2
D = Eθ .
x 2
We need to calculate the expectation of an inverse noncentral χ2d distribu-
2
tion with noncentrality parameter λ = θ2 . We use the representation of a
noncentral χ2d distribution as a Poisson weighted mixture, with N ∼ Poi(λ).
Then     
1 1
Eθ = E λ Eθ |N = k .
x 2 x 2
The conditional distribution of x 2 given N = k is a central χ2d+2k distribu-
1
tion. The expectation of an inverse χ2d+2k distribution is d+2k−2 . We obtain
   
1 1
Eθ = Eλ .
x 2 d + 2N − 2
Applying the Jensen inequality, it holds that
 
1 1 1
Eλ ≥ = .
d+N −2 d + 2Eλ N − 2 d + 2λ − 2
Finally we obtain
(d − 2)2
D≥ > 0.
d+ θ 2−2
2

4.2 Bayesian Decision Theory


Now we consider the Bayes model {P, π}; see Definition 2.3. This avoids the
problem that decision rules are not comparable with respect to the risk func-
tion, given in (4.3). We define the integrated risk as the expectation of
frequentist risk with respect to the prior π

r(π, δ) = Eπ R(θ, δ) = R(θ, δ)π(θ) dθ. (4.7)
Θ
To illustrate the advantages of this Bayesian approach we continue with our
toy example.
84 DECISION THEORY
Example 4.4 (Lion’s appetite)
We continue Example 4.2. In Example 2.4 a subjective prior is set for an adult
lion as (π(θi ))i=1,2,3 = (0.1, 0.1, 0.8). The integrated risk is calculated by

3
r(π, δ) = R(θi , δ)π(θi ).
i=1

We obtain r(π, δ1 ) = 0.085, r(π, δ2 ) = 0.023 and r(π, δ3 ) = 0.27. The second
rule is better than the first rule. The pessimistic and data-independent deci-
sion rule δ3 is the worst. 2

Definition 4.2 (Bayes decision rule) Assume the Bayes model {P, π},
the decision space D and the loss function L. A decision rule δ π is called a
Bayes decision rule iff it minimizes the integrated risk r(π, δ). The value

r(π) = r(π, δ π ) = inf r(π, δ)


δ

is called Bayes risk.

The posterior expectation of the loss function is called the posterior ex-
pected loss, i.e.,

π
ρ(π, d|x) = E (L(θ, d)|x) = L(θ, d)π(θ|x) dθ. (4.8)
Θ
The following theorem gives us a method to find the Bayes decision.

Theorem 4.1 For every x ∈ X , δ π (x) is given by

ρ(π, δ π (x)|x) = inf ρ(π, d|x). (4.9)


d∈D

Proof: The result follows directly from Fubini’s Theorem. Using


p(x|θ)π(θ) = π(θ|x)p(x) we have

r(π, δ) = R(θ, δ)π(θ) dθ
 
Θ

= L(θ, δ(x))p(x|θ) dx π(θ) dθ


Θ X
= L(θ, δ(x))π(θ|x) dθ p(x) dx (4.10)
X Θ
= ρ(π, δ(x))p(x) dx
X
≥ ρ(π, δ π (x))p(x) dx = r(π, δ π ).
X
2
COMMON BAYES DECISION RULES 85
4.3 Common Bayes Decision Rules
In this section we present the Bayes rules for the estimation problem D = Θ
with the quadratic loss, Laplace loss and also briefly discuss alternative loss
function. Exploring the analogy of estimation and prediction we derive Bayes
rules for prediction. Furthermore, we present results for D = {0, 1}.

To begin with the estimation problem, we consider a Bayes model {P, π} with
Θ ⊆ Rp . The goal is to find an “optimal” estimator: θ : x ∈ X → Θ. In other
words, we set D = Θ and search for θ = δ π .

4.3.1 Quadratic Loss


Consider the weighted L2 loss function,

LW (θ, d) = θ − d 2
W −1 = (θ − d)T W (θ − d), (4.11)

where W is a p × p positive definite matrix. For W = Ip we have LW (θ, d) =


L2 (θ, d). For a diagonal weight matrix W = diag(w1 , . . . , wp ) with wi > 0,
θ = (θ1 , . . . , θp ), and d = (d1 , . . . , dp ), it is


p
LW (θ, d) = wi (θi − di )2 .
i=1

Theorem 4.2 For every x ∈ X , δ π (x) with respect to LW is given by the


expectation of the posterior distribution π(θ|x)

δ π (x) = Eπ (θ|x). (4.12)

Proof: The result is a consequence of the projection property of the condi-


tional expectation. To see this, note that
 
ρ(π, d|x) = LW (θ, d)π(θ|x) dθ = (θ − d)T W (θ − d)π(θ|x) dθ
Θ Θ

= (θ ± δ (x) − d) W (θ ± δ π (x) − d)π(θ|x) dθ


π T


= (θ − δ π (x))T W (θ − δ π (x))π(θ|x)dθ (4.13)
Θ

+ 2 (d − δ π (x))T W (δ π (x) − θ)π(θ|x) dθ
 Θ
+ (d − δ π (x))T W (d − δ π (x))π(θ|x) dθ.
Θ
86 DECISION THEORY
π

Since δ (x) = Θ θ π(θ|x) dθ, it holds for the mixture term that

(d − δ π (x))T W (δ π (x) − θ)π(θ|x) dθ
Θ

= (d − δ (x)) W
π T
(δ π (x) − θ)π(θ|x) dθ
 Θ
 
= (d − δ (x)) W δ (x) −
π T π
θ π(θ|x) dθ
Θ
= 0.

Hence

ρ(π, d|x) = (θ − δ π (x))T W (θ − δ π (x))π(θ|x) dθ
Θ
+ (d − δ π (x))T W (d − δ π (x))
≥ ρ(π, δ π (x)).
2
We turn back to the toy data, introduced in Example 2.4.

Example 4.5 (Lion’s appetite)


The appetite of a lion has three different stages: hungry, moderate, lethargic.
We can order the parameters, in the sense that the stages imply different lev-
els of danger. But we have no distances between the parameters. A quadratic
loss is not applicable. 2

In case the posterior belongs to a known distribution class, with the first two
moments finite and known, we obtain the Bayes decision rule and the Bayes
risk without additional problems. We demonstrate it with the help of next
three examples.

Example 4.6 (Binomial distribution)


We consider X|θ ∼ Bin(n, θ). Recall Examples 3.6 and 3.20, and Figure 3.12.
The conjugate prior belongs to the family of beta distributions {Beta(α, β):
α > 0, β > 0}; see (2.7). The expected value μ and variance σ 2 of Beta(α, β)
are
α αβ
μ= , σ2 = 2
. (4.14)
α+β (α + β) (α + β + 1)
The posterior is Beta(α + x, β + n − x). For quadratic loss (θ − d)2 and for an
arbitrary beta prior π, the Bayes estimator is
α+x
δ π (x) = . (4.15)
α+β+n
COMMON BAYES DECISION RULES 87

1.0


MLE

Jeffreys ●

0.8
Laplace ●

subjective ●

● ●

0.6
estimation


0.4


● ●



0.2





0.0

0 1 2 3 4 5 6

Figure 4.4: Example 4.6. The Bayes estimators are related to the priors Beta(0.5, 0.5),
Beta(1, 1), Beta(3.8, 3.8); see also Figure 3.12.

Laplace prior is Beta(1, 1) and Jeffreys is Beta( 12 , 12 ), for which the associated
Bayes estimators are
1+x 0.5 + x
δ πLap (x) = , δ πJeff (x) = .
2+n 1+n
The moment estimator equals the maximum likelihood x
n. See Figure 4.4. 2

As second example we consider the normal distribution with normal priors as


introduced in Example 2.12.

Example 4.7 (Normal i.i.d. sample and normal prior)


We consider an i.i.d. sample X = (X1 , . . . , Xn ) from N(μ, σ 2 ) with known
variance σ 2 . The unknown parameter is θ = μ. For the normal prior N(μ0 , σ02 ),
the posterior, given in (2.8), is

x̄nσ02 + μ0 σ 2 σ02 σ 2
N(μ1 , σ12 ), with μ1 = 2 and σ 2
1 = .
nσ0 + σ 2 nσ02 + σ 2
We obtain as Bayes estimator

x̄nσ02 + μ0 σ 2
δ π (x) = . (4.16)
nσ02 + σ 2

For different prior parameters μ0 and σ02 estimators are plotted as functions
of the sufficient statistic x̄ in Figure 4.5. 2
88 DECISION THEORY

4
MLE
N(1,1)
N(−1,0.5)
2
estimation

0
−2
−4

−4 −2 0 2 4

sample mean

Figure 4.5: Example 4.7. The maximum likelihood estimator and Bayes estimators
related to the priors N(1, 1) and N(−1, 0.5) are plotted as function of the sample
mean, where n = 4 and σ 2 = 1.

The last example is related to Example 2.16. Here we have no closed form
expression for the Bayes estimator and computer-intensive methods are re-
quired.

Example 4.8 (Cauchy i.i.d. sample and normal prior )


We consider an i.i.d. sample X = (X1 , . . . , Xn ) from a Cauchy distribution
C(θ, 1) with location parameter θ and scale parameter γ = 1, with density
given in (2.13). As prior distribution of θ the normal distribution N(μ, σ 2 ) is
used with density ϕ(θ). We get for the posterior


n
1
π(θ|x) ∝ ϕ(θ).
i=1
1 + (xi − θ)2

The Bayes estimator is defined as



θh(θ)ϕ(θ) dθ m1 (x) n
1
δ (x) = Θ
π
= , with h(θ) = .
Θ
h(θ) ϕ(θ) dθ m(x) i=1
1 + (xi − θ)2

Both integrals have to be calculated numerically, for instance by indepen-


dent Monte Carlo. This can be done as following. Generate independent
θ(1) , . . . , θ(N ) from N(μ, σ 2 ) and approximate m(x) by

1 
N
m(x) = h(θ(j) ).
N j=1
COMMON BAYES DECISION RULES 89

20
L1
L2
absolute error
15
loss

10
5
0

−4 −2 0 2 4

theta−d

Figure 4.6: Illustration of different loss functions. The absolute error loss defined in
(4.17) is plotted with k1 = 2, k2 = 4.

For the other integral, generate a new sequence θ(1) , . . . , θ(N ) from N(μ, σ 2 )
and approximate m1 (x) by

1  (j)
N
m1 (x) = θ h(θ(j) ).
N j=1

4.3.2 Absolute Error Loss


In this subsection we consider the L1 loss and its asymmetric version. We
consider the Bayes model {P, π} with Θ ⊆ R and D = Θ, and define the
absolute error loss by

⎨ k (θ − d) if d < θ
2
L(k1 ,k2 ) (θ, d) = , for k1 > 0, k2 > 0. (4.17)
⎩ k (d − θ) if d ≥ θ
1

The absolute error loss is convex. For k1 > k2 overestimation of θ is punished


more severely. For k1 = k2 = 1 it is the L1 loss. The L1 loss increases slower
than the L2 loss, hence it is more important in robust statistics, where the
influence of outliers needs to be reduced. For illustration see Figure 4.6.
90 DECISION THEORY

Theorem 4.3 For every x ∈ X , δ π (x) with respect to L(k1 ,k2 ) is given by
the k1k+k
2
2
fractile of the posterior distribution π(θ|x)

k2
Pπ (θ ≤ δ π (x)|x) = . (4.18)
k1 + k2

Proof: Consider the posterior loss


ρ(π, d|x) = L(k1 ,k2 ) (θ, d)π(θ|x) dθ
Θ
 d  ∞
= k1 (d − θ)π(θ|x) dθ + k2 (θ − d)π(θ|x) dθ.
−∞ d

Denote the distribution function of the posterior distribution by F π (θ|x).


Using integration by parts we get for the first term
 d  d
(d − θ)π(θ|x) dθ =(d − θ)F π (θ|x) |d−∞ + F π (θ|x) dθ
−∞ −∞
 d
= F π (θ|x) dθ.
−∞

d
Using dx (−(1 − F (x)) = f (x) and integration by parts, we obtain for the
second term
 ∞  ∞
(θ − d)π(θ|x) dθ =(θ − d)(−(1 − F π (θ|x)) |∞
d + (1 − F π (θ|x)) dθ
d
 ∞ d

= (1 − F (θ|x)) dθ.
π
d

Hence
 d  ∞
ρ(π, d|x) = k1 π
F (θ|x) dθ + k2 (1 − F π (θ|x)) dθ.
−∞ d

The optimal decision should fulfill (4.9), which implies that the derivative of
the posterior loss with respect to d should be zero,

ρ (π, d|x) = 0.

We have
ρ (π, d|x) = k1 F π (d|x) − k2 (1 − F π (d|x)) = 0
and obtain
k2
F π (δ π (x)|x) = .
k1 + k2
2
COMMON BAYES DECISION RULES 91
Note that, for the L1 loss, we have k1 = k2 and the Bayes estimate is the
median of the posterior distribution. Let us consider an example.

Example 4.9 (Binomial distribution)


Continuation of Example 4.6. We consider X|θ ∼ Bin(n, θ) and θ ∼ Beta(α, β).
The posterior is Beta(α + x, β + n − x). For the L1 loss, L(θ, d) = |θ − d|, the
Bayes estimator is the median of Beta(α + x, β + n − x). Unfortunately the
median of the beta distribution Beta(α, β) is the inverse of an incomplete beta
function. We use the following approximation for the median of Beta(α, β)

α − 13
med ≈ , α > 1, β > 1. (4.19)
α + β − 23

Hence
α + x − 13
δ π (x) ≈ . (4.20)
α + β + n − 23
Laplace prior is Beta(1, 1) and Jeffreys is Beta( 12 , 12 ), and the related Bayes
estimators are, for x > 0, x ≤ 1

x + 23 1
+x
δ πLap (x) ≈ 4 , δ πJeff (x) ≈ 6
1 .
3 + n 3 +n
2

To continue with the prediction problem, we now consider a Bayes model


{P, π} with Θ ⊆ Rp . The goal is to predict, in an optimal way, a future data
point xf ∈ Xf generated by a distribution Qθ with the same parameter θ as
the data generating distribution Pθ . We set D = Xf .

4.3.3 Prediction
We assume the Bayes model {P, π} which has the posterior π(θ|x). The future
data point xf is generated by a distribution Qθ , possibly depending on x;
setting z = xf , we have q(z|θ, x). The prediction error Lpred (z, d) is the
loss we incur by predicting z = xf by some d ∈ D = Xf . Note that, the future
observation is a realization of a random variable. The prediction error is not
a loss function as given in (4.1). We define the loss function as the expected
prediction error 
L(θ, d) = Lpred (z, d)q(z|θ, x) dz. (4.21)
Xf

Then the posterior expected loss, defined in (4.8) as



ρ(π, d|x) = L(θ, d)π(θ|x) dθ, (4.22)
Θ
92 DECISION THEORY
is given by
 
ρ(π, d|x) = Lpred (z, d)q(z|θ, x) dz π(θ|x) dθ
Θ Xf
  (4.23)
= Lpred (z, d) q(z|θ, x)π(θ|x) dθ dz.
Xf Θ

We define the predictive distribution as the conditional distribution of the


future data point xf given the data x;

π(xf |x) = q(xf |θ, x)π(θ|x) dθ. (4.24)
Θ

Hence 
ρ(π, d|x) = Lpred (z, d)π(z|x) dz. (4.25)
Xf

Applying Theorem 4.1 we obtain the Bayes rule by minimizing the expected
posterior loss for each x ∈ X . We call this Bayes rule Bayes predictor.
ρ(π, δ π (x)|x) = inf ρ(π, d|x)
d∈D
 (4.26)
= inf Lpred (z, d)π(z|x) dz.
d∈D Xf

It is the same minimizing problem as for determining the Bayes estimator,


where the predictive distribution takes over the role of the posterior. Apply-
ing above results on Bayes estimators we obtain following theorem.

Theorem 4.4
1. For Lpred (z, d) = (z − d)T W (z − d), W positive definite a the Bayes
predictor is the expectation of the predictive distribution, E(z|x).
2. Set Xf = R and Lpred (z, d) = |z − d|. The Bayes predictor is the median
of the predictive distribution, Median(z|x).

We illustrate the prediction by the following example.

Example 4.10 (Bayes predictor)


We consider an i.i.d. sample X = (X1 , . . . , Xn ) from N(μ, σ 2 ) with known
variance σ 2 . The unknown parameter is θ = μ. For the normal prior N(μ0 , σ02 ),
the posterior, given in (2.8), is
x̄nσ02 + μ0 σ 2 σ02 σ 2
N(μ1 , σ12 ), with μ1 = and σ 2
1 = .
nσ02 + σ 2 nσ02 + σ 2
We want to predict
xf = x̄ + aθ + ε, ε ∼ N(0, σf2 ),
COMMON BAYES DECISION RULES 93
where σf2 and a are known and ε is independent of X and θ. Thus we have

xf |(θ, x) ∼ N(x̄ + aθ, σf2 ) and θ|x ∼ N(μ1 , σ12 ).

Applying known results for normal distribution, i.e., Theorem 6.1, we obtain
the predictive distribution

xf |x ∼ N(x̄ + aμ1 , a2 σ12 + σf2 ).

This distribution is symmetric about the expectation. In both cases of Theo-


rem 4.4 we obtain the same Bayes predictor

xf = x̄ + aμ1 .

It is just the prediction rule obtained by plugging in the estimator. For the
Bayes predictor we plug in the Bayes estimator. 2

Now we consider the testing problem.

4.3.4 The 0–1 Loss


We assume the Bayes model {P, π}. We are interested in a testing problem
with respect to the statistical model which is split into two disjunct submodels

P = {Pθ : θ ∈ Θ ⊆ Rp } = {Pθ : θ ∈ Θ0 } ∪ {Pθ : θ ∈ Θ1 }, (4.27)


where
Θ = Θ0 ∪ Θ1 , Θ0 ∩ Θ1 = ∅.
We set Pj = {Pθ : θ ∈ Θj } for j = 0, 1. The decision space has only two
elements D = {0, 1}, where d = j stands for the decision that the model Pj is
the true data generating distribution; see also the explanation in Chapter 2.
We define the 0–1 loss by


⎪ 0 if d = 0 θ ∈ Θ0

⎪ ⎧


⎨ 0 if d = 1 θ ∈ Θ ⎨ d if θ ∈ Θ0
1
L(0−1) (θ, d) = = . (4.28)

⎪ 1 if d = 0 θ ∈ Θ1 ⎩ 1 − d if θ ∈ Θ1




⎩ 1 if d = 1 θ ∈ Θ
0

The frequentist risk is



 ⎨ P (δ(x) = 1)
θ if θ ∈ Θ0
R(θ, δ) = L(0−1) (θ, δ(x))p(x|θ) dx = .
X ⎩ P (δ(x) = 0) if θ ∈ Θ1
θ

Recall from the theory of hypotheses testing, H0 : P0 versus H1 : P1 (see Liero


and Zwanzig, 2011, Chapter 5) the decision rule is a test. The error of first
94 DECISION THEORY
type occurs when we reject H0 , but the sample comes from P0 . The error of
second type occurs when we do not reject H0 , but the sample comes from P1 .
The frequentist risk is

⎨ P(Error of Type I) if θ ∈ Θ
0
R(θ, δ) = .
⎩ P(Error of Type II) if θ ∈ Θ
1

Theorem 4.5 For every x ∈ X , δ π (x) with respect to L(0−1) is given by



⎨ 1 if Pπ (Θ |x) < Pπ (Θ |x)
0 1
δ π (x) = (4.29)
⎩ 0 if Pπ (Θ |x) ≥ Pπ (Θ |x)
0 1

Proof: Consider the posterior loss. We have



ρ(π, d|x) = L(0−1) (θ, d)π(θ|x) dθ
Θ 
= d π(θ|x) dθ + (1 − d) π(θ|x) dθ.
Θ0 Θ1
= d Pπ (Θ0 |x) + (1 − d) Pπ (Θ1 |x)
= d (Pπ (Θ0 |x) − Pπ (Θ1 |x)) + Pπ (Θ1 |x)
≥ Pπ (Θ1 |x) + δ π (x)(Pπ (Θ0 |x) − Pπ (Θ1 |x))
= ρ(π, δ π (x)|x).
The inequality holds, because δ π (x) = 0 iff Pπ (Θ0 |x) − Pπ (Θ1 |x) is non–
negative.
2
This optimal rule is very intuitive. The posterior π(θ|x) is the weight function
on the parameter set Θ after the experiment with result x. We decide for the
subset with highest weight, see Figure 4.7.
It works well in all cases where data x exists such that 0 < Pπ (Θ0 |x) < 1. But
for simple hypothesis, Θ0 = {θ0 }, and for continuous posterior distributions
it holds that Pπ (Θ0 |x) = 0 for all x ∈ X . Then the Bayes rule is δ π (x) = 1
independent of the result of the experiment. This renders it as an unpracticable
method. In Chapter 8 we consider Bayes tests in more detail and offer Bayes
solutions for testing simple hypotheses.

Example 4.11 (Normal i.i.d. sample and inverse-gamma prior)


We consider Example 2.14. We have an i.i.d sample from N(0, θ). The un-
known parameter is the variance θ = σ 2 . We consider as prior distribution of
COMMON BAYES DECISION RULES 95

0.7
0.7

0.6
null hypothesis
0.6

null hypothesis
alternative alternative

0.5
0.5

0.4
0.4

posterior
posterior

0.3
0.3
0.2

0.2
0.1

0.1
0.0

0.0
● ●

0 1 2 3 4 5 0 1 2 3 4 5

variance variance

Figure 4.7: Example 4.11. The posterior π(θ|x) is plotted, with prior InvGamma(6, 9).
Left: Θ0 = (0, 1] and Θ1 = (1, ∞). The posterior probability of Θ1 is higher. The
Bayes decision is δ π (x) = 1. Right: Θ0 = {1} and Θ1 = (0, 1) ∪ (1, ∞). The posterior
probability of Θ0 is zero. The Bayes decision is δ π (x) = 1.

θ an inverse-gamma distribution InvGamma(α, β), defined in (2.11). The pos-


terior distribution is the inverse-gamma
n distribution with shape parameter
α + n2 and scale parameter 12 i=1 x2i + β. We consider the testing problem
H0 : σ 2 ≤ 1 versus H1 : σ 2 > 1. The situation changes for the testing problem
H0 : σ 2 = 1 versus H1 : σ 2 = 1, since in this case we will always reject H0 .
Figure 4.7 illustrates both situations. 2

4.3.5 Intrinsic Losses


Here we briefly discuss two more loss functions, which measure the distance
between distributions. They are of interest especially for model choice, when
we want to find a distribution which delivers a good fit to the data, but we are
less interested in the parameters themselves. We can use the Kullback–Leibler
divergence
  
p(x|θ)
LKL (θ, d) = K(Pθ |Pd ) = p(x|θ) ln dx (4.30)
X p(x|d)

and the Hellinger distance

 # 2
1 p(x|d)
2
LH (θ, d) = H (Pθ , Pd ) = −1 p(x|θ) dx. (4.31)
2 X p(x|θ)
96 DECISION THEORY
Note that the loss LKL (θ, d) is asymmetric, but LH (θ, d) is symmetric. We can
reformulate the Hellinger distance as
 # 2
1 p(x|d)
LH (θ, d) = − 1 p(x|θ) dx
2 X p(x|θ)
 #
1 p(x|d) p(x|d)
= −2 + 1 p(x|θ) dx
2 X p(x|θ) p(x|θ) (4.32)
   
1
= p(x|d) dx − 2 p(x|θ)p(x|d) dx + 1
2
X  X

=1− p(x|θ)p(x|d) dx = 1 − H 12 (Pθ , Pd ).


X

where H 12 (Pθ , Pd ) is the Hellinger transform, generally defined as


 
H 12 (P, Q) = p(x)q(x) dx. (4.33)
X

We compare the losses under normal distribution.

Example 4.12 (Normal distribution)


Consider the normal distribution N(θ, 1) with density
1 1
ϕθ (x) = √ exp(− (x − θ)2 ).
2π 2
The likelihood ratio is
ϕθ (x) 1 1
= exp(− (x − θ)2 + (x − d)2 )
ϕd (x) 2 2
 
1
= exp (x − θ)(θ − d) + (θ − d) 2
2
and
  
ϕθ (x)
LKL (θ, d) = Eθ ln
ϕd (x)
 
1
= Eθ (x − θ)(θ − d) + (θ − d)2 (4.34)
2
1
= (θ − d)2 .
2
Under normal distribution, Kullback–Leibler loss and the quadratic loss are
equivalent. Consider the Hellinger loss
 
LH (θ, d) = 1 − p(x|θ)p(x|d) dx.
X
THE MINIMAX CRITERION 97
We have
    
1 1
p(x|θ)p(x|d) dx = √ exp − ((x − θ) + (x − d) ) dx
2 2
X 2π X 4

1 1 θ+d 2 1
=√ exp(− (x − ) − (θ − d)2 ) dx
2π X 2 2 8
1
= exp(− (θ − d)2 ).
8
(4.35)
Hence
1
LH (θ, d) = 1 − exp(− (θ − d)2 ).
8
The Hellinger loss is a monotone transformation of the quadratic loss. 2

4.4 The Minimax Criterion


In order to show optimality results of a Bayes decision rule in a frequentist
context, we need to introduce the minimax criterion.
The first step is to expand the set of possible decisions. We define a random-
ized decision rule δ ∗ (x, .) as a distribution over the decision space D. A
non-randomized decision rule can be considered as a dirac distribution (one
point distribution), δ ∗ (x, a) = 1, for x = a. Thus we are now searching the
“optimal” decision in a bigger class. The loss function of a randomized decision
rule is defined as the expected loss


L(θ, δ (x, .)) = L(θ, a)δ ∗ (x, a) da. (4.36)
D

The risk function, as before, is given by



R(θ, δ ∗ ) = Eθ L(θ, δ ∗ (x, .)) = L(θ, δ ∗ (x, .))p(x|θ) dx. (4.37)
X

A famous example in this context is the Neyman–Pearson test, see for example
Liero and Zwanzig (2011, Section 5.3.1).

Example 4.13 (Randomized test) Let P = P0 ∪ P1 . We are interested in


the testing problem: H0 : P0 versus H1 : P1 . The decision space is D = {0, 1}.
A randomized decision rule is a distribution over D = {0, 1}, so that
δ ∗ (x, .) = Ber(ϕ(x)).
In this case the loss is
L(θ, δ ∗ (x, .)) = L(θ, 1)ϕ(x) + L(θ, 0)(1 − ϕ(x)).
For the risk we obtain
R(θ, δ ∗ ) = L(θ, 1)Eθ ϕ(x) + L(θ, 0)(1 − Eθ ϕ(x)).
2
98 DECISION THEORY

Denote the class of randomized decisions by D . We apply the minimax prin-
ciple in this larger class D∗ , and search for the best decision in a bad situation.

Definition 4.3 (Minimax) The minimax risk R associated with a loss


function L is the value

R = inf ∗ sup R(θ, δ) = inf ∗ sup Eθ L(θ, δ(x, .)). (4.38)


δ∈D θ∈Θ δ∈D θ∈Θ

The minimax decision rule is any rule δ0 ∈ D∗ such that

sup R(θ, δ0 ) = R. (4.39)


θ∈Θ

Example 4.14 (Lion’s appetite)


Continuing with Examples 4.1 and 4.2, a randomized decision is a three point
distribution over D = {d1 , d2 , d3 }

d1 d2 d3
δ ∗ (x, .) = , p3 (x) = p1 (x) + p2 (x).
p1 (x) p2 (x) p3 (x)
The non–randomized decision rule δ1 in Example 4.2 corresponds to

d1 d2 d3 d1 d2 d3
δ1∗ (1, .) = . . . δ1∗ (4, .) =
0 0 1 1 0 0
The loss of a randomized decision rule is calculated as
L(θ, δ ∗ (x, .) = p1 (x)L(θ, d1 ) + p2 (x)L(θ, d2 ) + p3 (x)L(θ, d3 ),
where L(θ, d) is defined in Example 4.1. We obtain the risk
R(θ, δ ∗ ) = Eθ p1 (x)L(θ, d1 ) + Eθ p2 (x)L(θ, d2 ) + Eθ p3 (x)L(θ, d3 ).
A randomized decision rule δ ∗ (x, .) is defined by

x 0 1 2 3 4
p1 (x) 0 0.2 0.5 0.8 0.9
p2 (x) 0.2 0.3 0.4 0.2 0.1
p3 (x) 0.8 0.5 0.1 0 0

Using the model P given in Example 2.4, we obtain for R(θ, δ ∗ ),

θ1 θ2 θ3
.
0.194 0.263 0.032
THE MINIMAX CRITERION 99

0.8

0.6
risk

0.4

● ●

0.2

● ●


0.0

● ●

1.0 1.5 2.0 2.5 3.0

theta

Figure 4.8: Example 4.14. We compare only four rules. δ1 , δ2 , δ3 are the non-
randomized rules in Example 4.2. The randomized rule δ ∗ given in Example 4.14 is
plotted as line. The comparison criterion is the maximal risk highlighted by thick
points. The decision minimizing the maximal risk is δ2 .

In Figure 4.8 the risk functions are for δ1 , δ2 , δ3 given in Example 4.2 and for
δ ∗ . Comparing these four rules, the minimax rule is δ2 . 2

Next we give two statements, which provide an insight into the relationship
between the minimax concept and admissibility. As shown in Figure 4.8 the
minimax rule is also admissible. Theorem 4.6 gives the corresponding result.

Theorem 4.6
If there exists a unique minimax rule, then it is admissible.

Proof: Let δ ∗ be an unique minimax rule. We carry out the proof by con-
tradiction. So let δ ∗ be not admissible. Then there exists a rule δ1 such that

R(θ, δ1 ) ≤ R(θ, δ ∗ ), for all θ ∈ Θ

and there exists a parameter θ0 ∈ Θ such that

R(θ0 , δ1 ) < R(θ0 , δ ∗ ).

This implies

sup R(θ, δ1 ) ≤ sup R(θ, δ ∗ ) = inf ∗ sup R(θ, δ)


θ∈Θ θ∈Θ δ∈D θ∈Θ

and that δ1 is also minimax, which is a contradiction to the uniqueness of the


minimax rule δ ∗ .
2
100 DECISION THEORY
The following result is on the opposite direction of the relation between
admissibility and minimaxity.

Theorem 4.7
Let δ0 be an admissible rule. If for all θ ∈ Θ the loss function L(θ, .) is
strictly convex and if for all θ ∈ Θ

R(θ, δ0 ) = const,

then δ0 is unique minimax.

Proof: Let δ0 be an admissible rule with constant risk, then for an arbitrary
θ0 ∈ Θ
sup R(θ, δ0 ) = R(θ0 , δ0 )
θ∈Θ

and there exists no δ with


 < R(θ1 , δ0 ) = R(θ0 , δ0 ), for some θ1 ∈ Θ.
R(θ1 , δ)

We carry out the proof by contradiction and state that δ0 is not unique min-
imax. Then there exists δ1 such that

sup R(θ, δ1 ) = sup R(θ, δ0 ) = R(θ0 , δ0 ).


θ∈Θ θ∈Θ

Thus
R(θ0 , δ1 ) ≤ sup R(θ, δ1 ) = R(θ0 , δ0 ). (4.40)
θ∈Θ

Let us consider the case

R(θ0 , δ1 ) = R(θ0 , δ0 ). (4.41)

We define a new decision rule


δ1 (x) + δ0 (x)
δ2 (x) = .
2
Because of the strong convexity of the loss function we obtain

0 ≤ R(θ0 , δ2 ) = L(θ0 , δ2 (x))p(x|θ0 ) dx
X  
1 1
< L(θ0 , δ1 (x)) + L(θ0 , δ0 (x)) p(x|θ0 ) dx
X 2 2
1 1
= R(θ0 , δ1 ) + R(θ0 , δ0 )
2 2
= R(θ0 , δ0 ).
Thus
R(θ0 , δ2 ) < R(θ0 , δ0 ),
BRIDGES 101
which is a contradiction to the admissibility of δ0 . We can exclude the case
(4.41) and the inequality in (4.40) is strong, which is a contradiction to the
admissibility of δ0 .
2

4.5 Bridges
In this section we show optimality results for Bayesian procedures in the fre-
quentist context. In a sense it is a bridge between the statistician who prefers
the Bayes model and the one who does not like the additional assumption of
a prior distribution. But it is not only some kind of justification of the Bayes
model, rather the bridge results deliver a method to prove optimality results
for a “frequentist” method when it can be rewritten as a Bayes rule.
We start with the statement that Bayes rules are not randomized.

Theorem 4.8 Assume a Bayes model {P, π}. Then it holds that

inf r(π, δ) = ∗inf ∗ r(π, δ ∗ ) = r(π).


δ∈D δ ∈D

Proof: Consider the posterior risk. For all x ∈ X and all δ ∗ ∈ D∗ we have
 
ρ(π, δ ∗ (x, .)|x) = L(θ, a)δ ∗ (x, a) da π(θ|x) dθ
Θ D
= L(θ, a)π(θ|x) dθ δ ∗ (x, a) da
D Θ

= ρ(π, a|x)δ ∗ (x, a) da
D

≥ inf ρ(π, a|x) δ ∗ (x, a) da = inf ρ(π, a|x).
a∈D D a∈D

Because r(π, δ ∗ ) = X
ρ(π, δ ∗ (x, .)|x)p(x)dx, we further have

r(π, δ ∗ ) ≥ inf r(π, δ)


δ∈D

Otherwise D∗ is the larger class, so that

inf r(π, δ) ≥ ∗inf ∗ r(π, δ ∗ ).


δ∈D δ ∈D

2
Here we have a bridge statement, which says that a Bayes rule can be admis-
sible. The Bayes rule δ π is optimal within the framework of the Bayes decision
theory, whereas admissibility is an optimality property which does not require
a Bayes model.
102 DECISION THEORY

Theorem 4.9 Assume a Bayes model {P, π}. If

π(θ) > 0, for all θ ∈ Θ and r(π) = inf r(π, δ) < ∞


δ∈D

and if
R(θ, δ) is continuous in θ for all δ,
then δ π is admissible.

Proof: The proof is by contradiction. Suppose that δ π is inadmissible. Then


there exists a rule δ1 such that

R(θ, δ1 ) ≤ R(θ, δ π ), for all θ ∈ Θ

and
R(θ1 , δ1 ) < R(θ1 , δ π ), for some θ1 ∈ Θ.
Because R(., δ) is continuous there exists a neighborhood C, θ1 ∈ C and
π(C) > 0 such that

R(θ, δ1 ) < R(θ, δ π ) for all θ ∈ C ⊂ Θ

and  
R(θ, δ1 )π(θ) dθ < R(θ, δ π )π(θ) dθ
C C

and for C = Θ \ C
 
R(θ, δ1 )π(θ) dθ ≤ R(θ, δ π )π(θ) dθ.
C C

Hence
 
r(π, δ1 ) = R(θ, δ1 )π(θ) dθ + R(θ, δ1 )π(θ) dθ
C C
< R(θ, δ π )π(θ) dθ + R(θ, δ π )π(θ) dθ
C C
= r(π, δ π ) = inf r(π, δ) < ∞.
δ∈D

This is a contradiction to the statement that δ π is a Bayes rule.


2
Let us apply this result to the binomial model.

Example 4.15 (Binomial distribution)


We consider X|θ ∼ Bin(n, θ) and θ ∼ Beta(α, β). We are interested in Bayes
estimation of θ under quadratic loss and apply the results of Example 4.9.
The Bayes estimate is
α+x
δ π (x) = . (4.42)
α+β+n
BRIDGES 103
It follows that
α + Eθ X α + nθ
Eθ δ π (X) = = (4.43)
α+β+n α+β+n
and
Varθ X nθ(1 − θ)
Varθ δ π (X) = = . (4.44)
(α + β + n)2 (α + β + n)2
We calculate the risk

n
R(θ, δ π ) = Pθ (k)(θ − δ π (k))2
k=0
=Varθ δ π (X) + (θ − Eθ δ π (X))2
1  2 
= 2
α + θ(n − 2α2 − 2αβ) + θ2 ((α + β)2 − n) .
(α + β + n)
(4.45)

The risk R(θ, δ π ) is a continuous function of θ. Further, the beta distribution


is positive over (0, 1). Thus we can apply Theorem 4.9. The Bayes estimator
1+n . 2
1+x
in (4.42) is admissible. Same holds for δ πLap (x) = 2+n and δ πJeff (x) = 0.5+x

In the following theorem, instead of continuity we require that the loss is


strictly convex and that the Bayes risk is bounded. Then we can go again over
the bridge from the Bayes side to the frequentist side.

Theorem 4.10 Assume a Bayes model {P, π}, where π can be improper.
If for all θ ∈ Θ the loss function L(θ, .) is strictly convex, and if δ π is a
Bayes rule with finite Bayes risk

r(π) = inf r(π, δ) < ∞,


δ∈D

then δ π is admissible.

Proof: Let δ π be inadmissible. There exists a rule δ1 such that

R(θ, δ1 ) ≤ R(θ, δ π ), for all θ ∈ Θ

and
R(θ1 , δ1 ) < R(θ1 , δ π ), for some θ1 ∈ Θ.
We define a new decision rule
δ1 (x) + δ π (x)
δ2 (x) = .
2
104 DECISION THEORY
Because of the strong convexity of the loss function we obtain

R(θ, δ2 ) = L(θ, δ2 (x))p(x|θ) dx
X  
1 1 π
< L(θ, δ1 (x)) + L(θ, δ (x)) p(x|θ) dx
X 2 2
1 1
= R(θ, δ1 ) + R(θ, δ π )
2 2
≤ R(θ, δ π ).
Thus
R(θ, δ2 ) < R(θ, δ π ) for all θ ∈ Θ.

For r(π, δ) = R(θ, δ) π(θ) dθ we obtain
r(π, δ2 ) < r(π, δ π ) = inf r(π, δ) < ∞,
δ∈D

which is a contradiction!
2
We apply this theorem in the following example on testing hypotheses.

Example 4.16 (P-value)


Let X ∼ N(μ, 1) and the Laplace prior π(μ) ∝ const. Then μ|x ∼ N(x, 1). We
consider the testing problem: H0 : μ ≤ 0 versus H1 : μ > 0. The parameter of
interest is defined as ⎧
⎨ 1 for μ ≤ 0
θ= .
⎩ 0 for μ > 0

Consider a quadratic loss. The Bayes estimate of θ is p(x) = E(θ|x), where


p(x) = Pπ (θ = 1|x) = Pπ (μ ≤ 0|x) = Pπ (μ − x ≤ −x|x) = 1 − Φ(x),
where Φ is the distribution function of N(0, 1). Applying Theorem 4.10 we
note that p(x) is admissible. Further, p(x) is the p-value of the test problem
above. 2
The p-value example is a good example of using the Bayesian interpretation
of a well-known frequentist estimate for showing its admissibility. Now we
discuss an example, where it is not possible to apply Theorem 4.10.

Example 4.17 (Normal distribution and Laplace prior)


We suppose X ∼ N(θ, 1) and the Laplace prior π(θ) ∝ const. Then θ|x ∼
N(x, 1). Theorem 4.2 states that the Bayes estimate with respect to quadratic
loss is δ π = E(θ|x) = x. We have R(θ, δ π ) = Var(X) = 1. The Bayes risk is
 ∞  ∞
r(π) = π
R(θ, δ ) dθ = 1 dθ = ∞,
−∞ −∞

so that Theorem 4.10 is not applicable. 2


BRIDGES 105
Here is one more bridge, to obtain admissibility of the Bayes rule.

Theorem 4.11 Assume a Bayes model {P, π}. If the Bayes rule δ π is
π
unique, then δ is admissible.

Proof: Suppose that δ π is inadmissible. Then there exists a rule δ1 such


that
R(θ, δ1 ) ≤ R(θ, δ π ), for all θ ∈ Θ
and
R(θ1 , δ1 ) < R(θ1 , δ π ), for some θ1 ∈ Θ.
This implies

r(π, δ1 ) ≤ R(θ, δ1 )π(θ) dθ

≤ R(θ, δ π )π(θ) dθ
Θ
= r(π),
i.e., δ1 is also a Bayes rule, which is a contradiction to the uniqueness of δ π .
2
In the following, we show a bridge between minimaxity and Bayes optimality.

Theorem 4.12
Assume
inf sup r(π, δ) = sup inf r(π, δ). (4.46)
δ∈D π π δ∈D

Then a minimax estimator δ0 is a Bayes estimator associated with π0 , where

sup r(π) = r(π0 ). (4.47)


π

The prior π0 in (4.47) is called least favourable prior. The statement above
means for a minimax estimator δ0 that
r(π0 , δ) ≤ r(π0 , δ0 ) = r(π0 ) for all δ ∈ D. (4.48)
Proof: We begin by showing the following general result for f (x) ≥ 0

sup f (x) =  sup f (x)π(x) dx. (4.49)
x {π: π(x)dx=1,π(x)≥0}

We have, for all π, π(x)dx = 1 and f (x) ≥ 0, so that
 
f (x)π(x) dx ≤ sup f (x) π(x) dx = sup f (x).
x x
106 DECISION THEORY
Take a sequence {xn } with limn→∞ f (xn ) = supx f (x) and define a sequence
of Dirac measures πn , πn (x) = 1 for x = xn , otherwise πn (x) = 0. Then
 
sup f (x)π(x) dx ≥ f (x)πn (x)dx = f (xn )
π

Taking the limit, we obtain the inequality in the other direction



sup f (x)π(x) dx ≥ lim f (xn ) = sup f (x)
π n→∞ x

and (4.49) follows. Now we prove (4.48). It suffices to show that r(π0 , δ0 ) ≤
r(π0 ). We have

r(π0 , δ0 ) = R(θ, δ0 )π0 (θ) dθ (definition of integrated risk)

≤ sup R(θ, δ0 ) (because of (4.49))


θ
= inf ∗ sup R(θ, δ) (δ0 is minimax)
δ∈D θ

= inf ∗ sup R(θ, δ)π(θ) dθ (because of (4.49))
δ∈D π

≤ inf sup R(θ, δ)π(θ) dθ (because of D ⊂ D∗ )
δ∈D π

= inf sup r(π, δ) (definition of integrated risk)


δ∈D π

= sup inf r(π, δ) (assumption (4.46))


π δ∈D
= sup r(π).
π
2
We use the toy example of lion’s appetite to illustrate the notion of a least
favourable prior.

Example 4.18 (Lion’s appetite)


Recall Example 4.1. We consider only two stages of lions’s appetite θ0 , θ1 ,
where θ0 means not hungry and θ1 means hungry. We set a new and simplified
model, instead of that in Example 2.4, by probabilities Pθ (x) given in the
following table:
x 0 1 2 3 4
θ0 0.4 0.4 0.1 0.1 0 (4.50)
θ1 0 0.05 0.05 0.8 0.1
The prior is a two point distribution over {θ0 , θ1 }. The prior probability, that
the lion was hungry, is p. For x ∈ {0, 1, 2, 3, 4}, the posterior is calculated as
k0 (x)
π(θ0 |x) = , π(θ1 |x) = 1 − π(θ0 |x)
k0 (x) + k1 (x)
BRIDGES 107
and
k0 (x) = (1 − p)Pθ0 (x), k1 (x) = pPθ1 (x).
We suppose an asymmetric loss, because it would be much more dangerous
to classify a hungry lion as no hungry as compared to the other way around.

L(θ, d) d = θ0 d = θ1
θ0 0 0.2 (4.51)
θ1 0.8 0

The posterior loss is defined by

ρ(π, d|x) = L(θ0 , d)π(θ0 |x) + L(θ1 , d)π(θ1 |x).

We obtain

ρ(π, θ0 |x) = 0.2 π(θ1 |x) and ρ(π, θ1 |x) = 0.8 π(θ0 |x),

and the Bayes decision is given by


⎧ ⎧
⎨ θ if 0.2 π(θ |x) < 0.8π(θ |x) ⎨ θ if p < g(x)
0 1 0 0
δ π (x) = = ,
⎩ θ otherwise ⎩ θ otherwise
1 1
(4.52)
where the threshold is
0.2 Pθ0 (x)
g(x) = .
0.8 Pθ1 (x) + 0.2 Pθ0 (x)

Using (4.50) and (4.51) we obtain the following rounded numbers

x 0 1 2 3 4
(4.53)
g(x) 1 0.67 0.32 0.2 0

This implies that we have five different Bayes rules δj , j = 1, . . . , 5. Set 1 for
decision θ1 and 0 for decision θ0 . Then

x 0 1 2 3 4
p = 0 δ1 (x) 0 0 0 0 0
0 < p ≤ 0.2 δ2 (x) 0 0 0 0 1
(4.54)
0.2 < p ≤ 0.32 δ3 (x) 0 0 0 1 1
0.32 < p ≤ 0.67 δ4 (x) 0 0 1 1 1
0.67 < p ≤ 1 δ5 (x) 0 1 1 1 1
108 DECISION THEORY

0.04
0.03
Bayes risk

0.02
0.01
0.00

0.0 0.2 0.4 0.6 0.8 1.0

prior probability of hungry stage

Figure 4.9: Example 4.18. The least favourable prior distributions have p between
0.32 and 0.67.

Using the loss in (4.51) we calculate


5
R(θ, δj ) = L(θ, δj (xi ))Pθ (xi )
i=1

and

r(p) = R(θ1 , δj )p + R(θ0 , δj )(1 − p), for p ∈ [g(xj ), g(xj−1 )] j = 1, . . . , 5.

The Bayes risk is piecewise linear. The supremum is attained for all p ∈
[0.32, 0.67]. The Bayes risk is plotted in Figure 4.9. For each given prior dis-
tribution the Bayes rule is unique. By Theorem 4.11, the Bayes rules in (4.54)
are admissible. 2

We now cross the bridge in the other direction. The next theorem gives a
sufficient condition for a Bayes rule to be minimax. Note that, the condition
is only sufficient!

Theorem 4.13 Assume a Bayes model {P, π}. If the Bayes rule δ π has
constant risk
R(θ, δ π ) = const, (4.55)
then δ π is minimax.
BRIDGES 109
Proof: Suppose that δ is not minimax. Then there exists a decision rule
π

δ1 such that

sup R(θ, δ1 ) < sup R(θ, δ π )


θ θ
= R(θ, δ π ) = const

= R(θ, δ π )π(θ) dθ
Θ
= r(π, δ π ).

Otherwise 
sup R(θ, δ1 ) ≥ R(θ, δ1 )π(θ) dθ = r(π, δ1 ).
θ Θ

This implies
r(π, δ1 ) < r(π, δ π )
which is a contradiction.
2

Example 4.19 (Lion’s appetite)


We continue with Example 4.18. Since the risk is constant for δ4 , we have

R(θ1 , δ4 ) = R(θ0 , δ4 ) = 0.04

Further r(π0 ) = 0.04 = supp r(p), see Figure 4.9. Hence the rule δ4 is our
favorite,
x 0 1 2 3 4
(4.56)
δ4 (x) 0 0 1 1 1
It is admissible and minimax and associated with a least favourable prior. We
are on the safe side and decide that if the lion has eaten more than 1 person
the lion was hungry; no surprise! 2

Let us now consider a more formal example.

Example 4.20 (Binomial distribution)


We continue with Example 4.15, where X|θ ∼ Bin(n, θ), and we consider
only the family of conjugate beta priors, θ ∼ Beta(α, β). Then the posterior
is also a beta distribution. Under quadratic loss, the Bayes estimator is the
expectation of the posterior distribution, given in (4.42). Further the risk
function is calculated in (4.45) as
1  2 
R(θ, δ π ) = 2
α + θ(n − 2α2 − 2αβ) + θ2 ((α + β)2 − n) .
(α + β + n)
110 DECISION THEORY
3.0

1.0

Laplace Laplace ●
2.5

Jeffreys ●

0.8
Jeffreys ●
Least favourable minimax
MLE
2.0

Bayes estimation

0.6

prior

1.5

0.4


1.0

0.2

0.5



0.0

0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

theta x

Figure 4.10: Example 4.20. X ∼ Bin(3, θ). Left: Prior √


distributions,

Laplace:
Beta(1, 1), Jeffreys: Beta(0.5, 0.5), least favourable: Beta( 23 , 23 ) are plotted. Right:
The Bayes estimators associated with the priors are plotted. For comparison also
the maximum likelihood estimator is included. The Bayes estimators overestimate
small values and underestimate large values of θ.


n
Plugging α = β = 2 in (4.45) we get

2n 2n
n − 2α2 − 2αβ = n − − =0
4 4
and √
(α + β)2 − n = ( n)2 − n = 0
so that the risk is constant. The

least favourable prior belongs to the conjugate

family; it is the prior Beta( 2n , 2n ). Applying Theorem 4.13 the estimator

n
+x
δ(x) = 2
√ (4.57)
n+ n

is minimax. In Figure 4.10 different priors and Bayes estimates are plotted
and compared. 2

4.6 List of Problems


1. Consider a query for analyzing customer satisfaction. The parameter has
two levels θ ∈ {0, 1} where 1 means the customer is satisfied. The cos-
tumer can choose between bad, acceptable, and good, coded as −1, 0, 1,
respectively. The goal is to estimate θ under 0–1 loss. The probabilities
LIST OF PROBLEMS 111
Pθ (x) = P(X = x|θ) are given in the following table:

x −1 0 1
θ=0 0.5 0.3 0.2
θ=1 0 0.5 0.5

Assume that the prior probability of θ = 1 is p.


(a) Determine the posterior distribution.
(b) Determine the Bayes estimator of θ.
(c) Calculate the frequentist risk.
(d) Calculate the Bayes risk.
(e) Determine the least favorable prior.
2. Consider the binomial model:

X ∼ Bin(n, θ) and θ ∼ Beta(α, β)

and the asymmetric absolute error loss:



⎨ k (θ − d) if θ>d
2
L(θ, d) = (4.58)
⎩ k (d − θ) if θ≤d
1

(a) Derive the Bayes estimator associated with the loss in (4.58).
(b) Set k1 = k2 . Give an approximate expression for the Bayes estimator.
(c) Discuss possible applications of this asymmetric approach. Give an ex-
ample.
(d) Is it possible to find a prior inside the class of beta distributions
{Beta(α, β) : α > 1, β > 1} such that the Bayes estimator has a constant
risk? Why is a constant risk of interest?
3. Consider X ∼ Gamma(α, β), where α > 2 is known. The parameter of
interest is θ = β.
(a) Derive Jeffreys prior and the related posterior.
(b) Derive the Bayes estimator δ π (x) under the L2 loss function.
(c) Calculate the frequentist risk, the posterior expected loss and the Bayes
risk of δ π (x).
(d) Show that δ π (x) is admissible.
4. Consider X ∼ Gamma(α, β), where α > 2 is known. The parameter of
interest is θ = β1 .
(a) Does Gamma(α, β) belong to an exponential family? Derive the natural
parameter and the related sufficient statistics.
(b) Derive the efficient estimator δ ∗ .
(c) Derive the conjugate prior for θ. Calculate the posterior.
112 DECISION THEORY
π
(d) Derive the Bayes estimator δ (x) under the L2 loss function.
(e) Calculate the frequentist risk of δ π and of δ ∗ , the posterior expected loss
of δ π and the Bayes risk of δ π .
(f) Is δ π admissible?
(g) Compare the frequentist risk of δ π and of δ ∗ .
5. Consider X ∼ Mult(n, θ1 , θ2 , θ3 ) and


n
θ = (θ1 , θ2 , θ3 ) ∈ Θ = {(ϑ1 , ϑ2 , ϑ3 ) : ϑi ∈ [0, 1], ϑi = 1}
i=1

with θ ∼ Dir(α1 , α2 , α3 ). Assume the loss L(θ, d) = (θ − d)T (θ − d).


(a) Derive the posterior distribution.
(b) Derive the Bayes estimator δ π (x) under the L2 loss function.
(c) Calculate the frequentist risk of the Bayes estimator.
(d) Show that the Bayes risk is finite.
(e) Determine the least favourable prior.
(f) Calculate the minimax estimator for θ.
Chapter 5

Asymptotic Theory

In this chapter we define and explain the meaning of asymptotic consistency


in a Bayes model. The reader not interested in asymptotic theory can jump
over this chapter. But we would like to give two important messages on the
way. If Bayesian methods should work in the long run then the least we need
is that
- the prior does not exclude any data generating parameter, and
- the parameters in P are identifiable.
The main result we present is Schwartz’ Theorem. It is a generalization of
Doob’s Theorem, which is based on measure theory especially on the martin-
gale method. Schwartz’ Theorem has an information–theoretic background.
We give the main steps of the proof because of its link to the reference priors.
In this textbook we want to avoid measure theory as much as possible, which
makes the presentation of results in this chapter essentially complicated. We
shall, therefore, try to be as precise as possible. For further reading we rec-
ommend Liese and Miescke (2008, Chapter 7, Section 7.5.2) and the original
papers Schwartz (1965) and Diaconis and Freedman (1986).

5.1 Consistency
Recall from Chapter 2 that, in a Bayes model {P, π}, the notion of a true
parameter makes no sense. The parameter is a random variable with known
distribution π. But given the data x we can ask for the data generating dis-
tribution. Like in decision theory we cross the bridge and consider a Bayes
procedure inside the model P = {Pθ : θ ∈ Θ} and study its consistency.
Having in mind the Bayesian inference principle, that all inference on θ is
based on the posterior distribution, we define consistency with respect to the
posterior distribution. Note that, we use the capital letter X(n) for the random
variable and x(n) for its realization, we set Pπ (.) for the prior distribution
with density π(.) and Pπn (.|x(n) ) for the posterior distribution with density
πn (.|x(n) ).

DOI: 10.1201/9781003221623-5 113


114 ASYMPTOTIC THEORY

Definition 5.1 (Strong Consistency) Assume a Bayes model


(n)
{P , π}. Assume that X(n) ∼ Pθ0 . Then a sequence of posteriors
(n)

πn (θ|X(n) ) is called strongly consistent at θ0 iff for every open subset


O ⊂ Θ with θ0 ∈ O it holds that

Pπn (O|X(n) ) → 1, as n → ∞ with probability 1. (5.1)

This definition is general and also includes dependent data. In case of an i.i.d.
sample we have X(n) = (X1 , . . . , Xn ) ∼ P⊗n
θ0 . The posterior distribution con-
tracts to the data generating parameter θ0 and becomes the Dirac distribution
at the point θ0 . We illustrate the definition for the binomial model.

Example 5.1 (Binomial distribution)


We consider the Bayes model X|θ ∼ Bin(n, θ) and θ ∼ Beta(α, β), so that
θ|X ∼ Beta(α + x, β + n − x), see Example 2.11. The expectation and the
variance of the posterior are
α+x α x n
E(θ|X = x) = = +
α+β+n α+β+n nα+β+n
(α + x)(β + n − x)
Var(θ|X = x) .
(α + β + n)2 (α + β + n + 1)

Since 0 ≤ x ≤ n, we have

(α + n)(β + n)
Eθ0 (Var(θ|X = x)) ≤ → 0, for n → ∞.
(α + β + n)2 (α + β + n + 1)

Thus the posterior concentrates more and more around


α x n
lim Eθ0 (E(θ|X = x)) = lim + lim Eθ = θ0 .
n→∞ n→∞ α + β + n n→∞ 0 n α+β+n
2

Definition 5.1 implies the consistency of Bayes estimators in the frequentist


model. First we recall the definition of a Bayes estimator, given in Chapter 4.
The loss function

L : (θ, d) ∈ Θ × Θ → L(θ, d) ∈ R+ ,

is the penalty for the choice of d, instead of θ. We assume the following contrast
condition, that there exists a constant c0 > 0 for all d, such that

d − θ0 c0 ≤ L(θ0 , d) − L(θ0 , θ0 ). (5.2)


CONSISTENCY 115
This condition implies that the loss function L(θ0 , .), as a function of d, has a
unique minimum at θ0 . Furthermore we require the condition that there exists
(n)
a constant K for all X(n) ∼ Pθ0 such that

L2 (θ, θ0 ) πn (θ|X(n) ) dθ ≤ K 2 a.s. (5.3)
Θ

The posterior expected loss is defined as



ρn (π, d|x(n) ) = L(θ, d)πn (θ|x(n) ) dθ.
Θ
In Theorem 4.1 it is shown that the Bayes estimator δnπ (x(n) ) fulfills the fol-
lowing condition. For every x(n) ∈ X (n) , δnπ (x(n) ) is given by

ρn (π, δnπ (x(n) )|x(n) ) = inf ρn (π, d|x(n) ).


d

Theorem 5.1
Assume the following conditions:
1. The loss function fulfills the conditions (5.2) and (5.3).
2. For all ε > 0 and all open sets O ⊂ Θ with θ0 ∈ O it holds for

Bε (θ0 ) = {θ; θ ∈ O, |L(θ, d) − L(θ0 , d)| < ε, for all d}

that
Pπ (Bε (θ0 )) > 0. (5.4)
(n)
3. Let X(n) ∼ Pθ0 and the sequence of posteriors πn (θ|X(n) ) be strongly
consistent at θ0 .
Then for n → ∞
δnπ (X(n) ) → θ0 a.s.

Proof: Set B = Bε (θ0 ) and B c = Θ \ B. It holds that



ρn (π, θ0 |x(n) ) ≥ ρn (π, δnπ (x(n) )|x(n) ) = L(θ, δnπ (x(n) ))πn (θ|x(n) ) dθ
 Θ

≥ π
L(θ, δn (x(n) ))πn (θ|x(n) ) dθ
B (5.5)
≥ (L(θ0 , δnπ (x(n) )) − ε)πn (θ|x(n) ) dθ
B
≥ (L(θ0 , δnπ (x(n) )) − ε) Pπn (B|x(n) ).
Otherwise we have

L(θ, θ0 )πn (θ|x(n) ) dθ ≤ (L(θ0 , θ0 ) + ε)Pπn (B|x(n) ).
B
116 ASYMPTOTIC THEORY
and because of (5.3) and the Cauchy–Schwartz inequality,
 2
L(θ, θ0 )πn (θ|x(n) ) dθ
c
B 
≤ L(θ, θ0 )2 πn (θ|x(n) ) dθ IB c (θ)πn (θ|x(n) ) dθ

≤ K 2 Pπn (B c |x(n) ).

Summarizing, we get
 
ρn (π, θ0 |x(n) ) = L(θ, θ0 )πn (θ|x(n) ) dθ + L(θ, θ0 )πn (θ|x(n) ) dθ
B Bc (5.6)
1
≤ (L(θ0 , θ0 ) + ε)Pπn (B|x(n) ) +K (Pπn (B c |x(n) )) 2 .

From (5.5) and (5.6) it follows that

(L(θ0 , δ π (x(n) )) − ε)Pπn (B|x(n) )


1
≤ (L(θ0 , θ0 ) + ε)Pπn (B|x(n) ) + K (Pπn (B c |x(n) )) 2

and 1
(Pπ (B c |x(n) )) 2
L(θ0 , δnπ (x(n) )) ≤ 2ε + L(θ0 , θ0 ) + K nπ .
Pn (B|x(n) )
From (5.4), B ⊂ O and the consistency of the posterior, it holds that
1
(Pπn (B c |x(n) )) 2
→ 0, a.s.
Pπn (B|x(n) )

Because of (5.2) we have L(θ0 , θ0 ) ≤ L(θ0 , δ π (x(n) )) and for ε → 0 we obtain

0 ≤ δnπ (x(n) ) − θ0 c0 < L(θ0 , δnπ (x(n) )) − L(θ0 , θ0 ) → 0 a.s.

2
The assumption (5.4) includes the requirement that the prior distribution
is positive around the data generating parameter θ0 . The following example
illustrates what happens when the prior excludes θ0 .

Example 5.2 (Counterexample)


Assume Pθ = N(θ, 1) and θ0 = 2. X1 , . . . , Xn are i.i.d from N(2, 1). The prior
is the uniform distribution over [−1, 1]. From Example 2.8, we obtain
n
πn (θ|x(n) ) ∝ π(θ)(θ|xn )(θ) ∝ exp − (x̄ − θ)2 I[−1,1] (θ).
2
Hence the posterior is a truncated normal distribution with μ = x̄ and σ 2 = n1
and boundaries a = −1, b = 1. The series of posteriors cannot contract around
CONSISTENCY 117

likelihood
prior

8
posterior, n=1
posterior, n=2
posterior, n=10
6
4
2
0

−2 −1 0 1 2 3

theta

Figure 5.1: Example 5.2. The posterior distributions are concentrated at the border
b = 1, while the data generating parameter is θ0 = 2.

θ0 = 2, see Figure 5.1. We apply a quadratic loss L(θ, d) = |θ − d|2 . The loss
fulfills conditions (5.2)and (5.3). From Theorem 4.2 we get

1 ϕ(α) − ϕ(β)
δnπ (x(n) ) = E(θ|x(n) ) = x̄ + √ ,
n Φ(β) − Φ(α)
√ √
where α = n(−1 − x̄), β = n(1 − x̄), ϕ and Φ are the density and dis-
tribution functions of N(0, 1). Note that, this expression cannot be used for
calculations, since the posterior support is lying in the tails of a normal dis-
tribution that implies the fraction of two nearly zero terms. From the Law of
Large Numbers we obtain that x̄ → 2 a.s. for n → ∞. Applying the rule of
L’ Hospitale we get
δnπ (x(n) ) − x̄ → −1 a.s.
2

As a consequence of this example we can formulate a general rule of thumb:

Never exclude any possible parameters from the prior!

Note that, Assumption (5.4) is stronger than that the prior covers all possi-
ble parameters. It also requires a continuity of the parametrization. Here we
present a counterexample from Bahadur; see Schwartz (1965).
118 ASYMPTOTIC THEORY
Example 5.3 (Bahadur’s counterexample)
Assume X1 , . . . , Xn are i.i.d from Pθ with θ ∈ Θ = [1, 2), where

⎨ U[0, 1] for θ = 1
Pθ =
⎩ U[0, 2 ] for 1 < θ < 2
θ

For y = xmax = maxi=1,...,n xi the likelihood function is




⎪ 1 for θ = 1 and y ≤ 1


n
θ
(θ|x(n) ) = for y2 > θ > 1 and y > 1

⎪ 2n

⎩ 0 else

and we obtain the maximum likelihood estimator



⎨ 1 for y ≤ 1
θM LE =
⎩ 2 for y > 1.
y

Consider now the Bayes model with uniform prior π(θ) = 1 for 1 ≤ θ < 2.
The Bayes rule under quadratic loss is the expectation of the posterior. We
have π(θ|x(n) ) ∝ (θ|x(n) ) and
2
θ (θ|x(n) ) dθ
δ(x(n) ) = 1 2
1
(θ|x(n) ) dθ
b
Applying the integral a
z m dz = m+11
(bm+1 − am+1 ), with a = 1 and b = 2
y
we obtain ⎧
⎨ n+1 2n+2 −1 for y ≤ 1
n+2 2n+1 −1
δ(x(n) ) = .
⎩ n+1 bn+2 −1 for y > 1, b = 2
n+2 b n+1 −1 y

For θ = 1 we have Pθ (y > 1) = 0 and θM LE = 1, but


 
n + 1 2n+2 − 1
lim δ(x (n) ) = lim = 2.
n→∞ n→∞ n + 2 2n+1 − 1

The maximum likelihood estimator is consistent while the Bayes estimator is


not. For illustration see Figure 5.2. 2

In Example 5.1 the limit behavior of the posterior is independent of the prior.
This can be proved generally. This property was also one of the key points for
the construction of the reference priors; see the discussion in Section 3.3.3.
CONSISTENCY 119
1.2

1.0
theta=1
theta=1.3
1.0

0.8
theta=1.9
0.8

0.6
0.6

0.4
0.4

0.2
0.2
0.0

0.0
0.0 0.5 1.0 1.5 2.0 1.0 1.2 1.4 1.6 1.8 2.0

x theta

Figure 5.2: Example 5.3. Left: densities for θ = 1 and θ = 1.3, 1.9. Right: likelihood
functions with xmax = 1.1 and with xmax = 0.8 (broken line). The maximum likeli-
hood estimator is xmax
2
≈ 1.82 in the first case and 1 in the second.

(n)
Theorem 5.2 Assume X(n) ∼ Pθ0 and two different priors π1 , π2 , both
continuous and positive at θ0 . Assume that for i = 1, 2 the posterior series
πi,n (θ|X(n) ), associated with πi , are strongly consistent. Then

sup |Pπn1 (A|X(n) ) − Pπn2 (A|X(n) )| → 0, for n → ∞, a.s. (5.7)


A

Proof: We have for i = 1, 2



μi (A)
Pπni (A|X(n) ) = with μi (A) = pn (x(n) |θ)πi (θ) dθ.
μi (Θ) A

Let O1 ⊃ O2 ⊃ . . . be a sequence of open balls with centers θ0 . Because of the


consistency of the posterior, for every given η > 1 there exist m, n0 = n0 (m)
such that for all n > n0
1
Pπni (Om |X(n) ) ≥ , i = 1, 2. (5.8)
η
The continuity of the priors implies the existence of a m0 such that for all
θ ∈ O m0
1 π1 (θ) π1 (θ0 )
α≤ ≤ η α, with α = . (5.9)
η π2 (θ) π2 (θ0 )
Using (5.9) we obtain

π1 (θ)
μ1 (Om0 ) = pn (x(n) |θ) π2 (θ) dθ ≤ η α μ2 (Om0 )
O m0 π2 (θ)
120 ASYMPTOTIC THEORY
and
1
μ1 (Om0 ) ≥ α μ2 (Om0 ).
η
Thus
1 μ1 (Om0 )
α≤ ≤ η α. (5.10)
η μ2 (Om0 )
We factorize
μ1 (Θ) μ1 (Θ) μ1 (Om0 ) μ2 (Om0 )
=
μ2 (Θ) μ1 (Om0 ) μ2 (Om0 ) μ2 (Θ)
and apply (5.8), (5.10) so that

1 μ1 (Θ)
α≤ ≤ η 2 α. (5.11)
η2 μ2 (Θ)
c
Set Om 0
= Θ \ Om0 . Using (5.11) we can estimate the posteriors

μ1 (A) μ2 (A)
Pπn1 (A|X(n) )−Pπn2 (A|X(n) ) = −
μ1 (Θ) μ2 (Θ)
μ1 (A ∩ Om0 ) μ2 (A ∩ Om0 ) μ1 (Om c
)
≤ − + 0

μ1 (Θ) μ2 (Θ) μ1 (Θ) (5.12)


μ2 (A ∩ Om0 ) μ1 (Om c
)
≤ (η 3 − 1) + 0

μ2 (Θ) μ1 (Θ)
≤ (η 3 − 1)Pπn2 (Om0 |X(n) ) + Pπn1 (Om
c
0
|X(n) ).

Since the posteriors are consistent, we have

lim Pπn2 (Om0 |X(n) ) = 1 a.s., lim Pπn1 (Om


c
0
|X(n) ) = 0 a.s.
n→∞ n→∞

The constant η > 1 is arbitrary; letting η → 1 we complete the proof.


2

5.2 Schwartz’ Theorem


Now we establish the theorem of Lorraine Schwartz, which gives general con-
ditions for posterior consistency.
Recall the Kullback–Leibler divergence K(Pθ0 |Pθ ) in (3.28), K(P|Q) =

p(x) ln( p(x)
q(x) )dx.
The other condition is related to the Hellinger transfom, defined by
 
H 12 (P, Q) = p(x)q(x) dx. (5.13)
SCHWARTZ’ THEOREM 121

Theorem 5.3 (Schwartz) Assume a Bayes model {P (n) , π} with


⊗n
P (n)
= {Pθ : θ ∈ Θ}. Let for all ε > 0,

Pπ (Kε (θ0 )) > 0, where Kε (θ0 ) = {θ : K(Pθ0 |Pθ ) < ε}. (5.14)

For every open set O ⊂ Θ with θ0 ∈ O and Oc = Θ\O, there exist constants
D0 and q0 , q0 < 1, such that

H 21 (P⊗n
θ0 , Pn,O ) ≤ D0 q0 ,
c
n
(5.15)

where Pn,Oc is defined by


 
π(θ)
Pn,Oc (A) = pn (x(n) |θ) dθ dx(n) . (5.16)
A Oc Pπ (Oc )

Then the sequence of posteriors is strongly consistent at θ0 .

Proof: We only provide a sketch of the proof. For a precise proof based on
measure theory we recommend Liese and Miescke (2008, p. 354ff). We have
to show that for an arbitrary open subset O with θ0 ∈ O

Pπn (Oc |X(n) ) → 0 for n → ∞ a.s. (5.17)

It holds that

μ(A)
Pπn (A|x(n) ) = with μ(A) = pn (x(n) |θ) π(θ) dθ. (5.18)
μ(Θ) A

We rewrite the fraction as


μ(Oc ) Numn (β)
Pπn (Oc |x(n) ) = = (5.19)
μ(Θ) Denn (β)

with
Numn (β) = exp(βn)μ(Oc ) pn (x(n) |θ0 )−1 ,
(5.20)
Denn (β) = exp(βn)μ(Θ) pn (x(n) |θ0 )−1 .

The proof is split into two steps. In the first step we show that under assump-
tion (5.15) there exists a β0 such that

lim sup Numn (β0 ) = 0 a.s. (5.21)


n→∞

In the second step we show that under (5.14), for all β > 0,

lim inf Denn (β) = ∞ a.s. (5.22)


n→∞
122 ASYMPTOTIC THEORY
Setting β = β0 in (5.22) we obtain (5.17). It remains to show (5.21) and (5.22).
We start with (5.21). We have
   12
π(θ)
H 12 (P⊗n
1
θ0 , Pn,O ) =
c pn (x(n) |θ0 ) 2 pn (x(n) |θ) dθ dx(n)
Oc Pπ (Oc )

1 1
= pn (x(n) |θ0 )g(x(n) ) 2 dx(n) = Eθ0 g(X(n) ) 2
(5.23)
with 
pn (x(n) |θ) π(θ)
g(x(n) ) = dθ. (5.24)
Oc pn (x(n) |θ0 ) Pπ (Oc )
From (5.15) it follows that
1
Eθ0 g 2 ≤ D0 q0n . (5.25)

Using (5.25) and the inequality for non-negative random variables Z that
P(Z ≥ 1) ≤ EZ, we obtain for arbitrary r > 1
1
P(r2n g(x(n) ) ≥ 1) = P(rn g(x(n) ) 2 ≥ 1)
1
(5.26)
≤ rn Eθ0 g 2 ≤ D0 (r q0 )n .
1
Choosing r such that 1 < r < q0 we get

 ∞

P(r g(x(n) ) ≥ 1) ≤ D0
2n
(r q0 )n < ∞. (5.27)
n=1 n=1

The Borel–Cantelli Lemma yields

P(r2n g(x(n) ) ≥ 1 infinitely often) = 0. (5.28)

We set β0 such that r = exp(β0 ), and get from (5.28)

Numn (β0 ) = r−n r2n g(x(n) ) Pπ (Oc ) ≤ r−n r2n g(x(n) ) ≤ r−n a.s.

Since r > 1 we get r−n → 0 and (5.21), which completes the first step of the
proof. To show (5.22), we set β = 2nε for arbitrary ε > 0. We have

pn (x(n) |θ)
lim inf Denn (β) = lim inf exp(2nε) π(θ) dθ
n→∞ n→∞ Θ n (x(n) |θ0 )
p

pn (x(n) |θ)
≥ lim inf exp(2nε) π(θ) dθ
n→∞ Kε (θ0 ) n (x(n) |θ0 )
p
   
pn (x(n) |θ0 )
= lim inf exp n2ε − ln π(θ) dθ
n→∞ K (θ )
ε 0
pn (x(n) |θ)
=D
(5.29)
SCHWARTZ’ THEOREM 123
P⊗n
(n)
As Pθ = θ , we have
 
pn (x(n) |θ0 )
ln = ln(pn (x(n) |θ0 )) − ln(pn (x(n) |θ))
pn (x(n) |θ)
n n
= ln(pn (xi |θ0 )) − ln(pn (xi |θ)) (5.30)
i=1 i=1
n  
pn (xi |θ0 )
= ln .
i=1
pn (xi |θ)

Applying Jensen’s inequality, that f (EX) ≤ E(f (x)) for convex functions f ,
and using the factorization

π(θ) = Pπ (Kε (θ0 ))πε (θ) with πε (θ) = π(θ|Kε (θ0 )) (5.31)

we obtain

1  pn (xi |θ0 )
n
D ≥ lim inf exp n 2ε −
ln π(θ) dθ
n→∞ Kε (θ0 ) n i=1 pn (xi |θ)
 (5.32)
1
n
≥ lim inf exp Pπ (Kε (θ0 )) n 2ε πε (θ) dθ − Yi ,
n→∞ Kε (θ0 ) n i=1

where  
p(xi |θ0 )
Yi = ln πε (θ) dθ.
p(xi |θ)
n
The strong law of large numbers yields n1 i=1 Yi → EY1 , a.s. for n → ∞. We
calculate EY1 ,

p(x1 |θ0 )
EY1 = ln( )πε (θ) dθ p(x1 |θ0 ) dx1
p(x1 |θ)

p(x1 |θ0 )
= ln( )p(x1 |θ0 ) dx1 πε (θ) dθ (5.33)
p(x1 |θ)

= K(Pθ0 |Pθ )πε (θ) dθ,

using Fubini’s Theorem.


From (5.14) it follows that
 
EY1 = K(Pθ0 |Pθ )πε (θ) dθ ≤ K(Pθ0 |Pθ )πε (θ) dθ ≤ ε Pπε (Kε (θ0 )) ≤ ε.
Kε (θ0 )

Summarizing, we obtain

lim inf Denn (β) ≥ lim inf exp (n Pπ (Kε (θ0 )) (2ε − ε)) = ∞
n→∞ n→∞

which gives (5.22) and completes the proof.


2
124 ASYMPTOTIC THEORY
We complete this chapter with a short discussion of the two main assumptions
in Schwartz’ Theorem.
The first main assumption (5.14) implies that the prior does not exclude the
data generating parameter θ0 ; see Example 5.2. Furthermore, a continuity of
the prior at θ0 with respect to the Kullback–Leibler divergence is required.
The second main assumption (5.15) is related to the Hellinger transform
(5.13). Recall the relation between Hellinger distance and Hellinger transform
in (4.32), so that

H2 (P⊗n c
⊗n
θ0 , Pn,O ) = 1 − H 12 (Pθ0 , Pn,O ) > 1 − D0 q0 .
c
n

It describes the ability to differentiate between P⊗n


θ0 and Pn,O , which can be
c

described by the existence of consistent tests. The following statement holds


now.

Theorem 5.4 Assume the model P (n) = {P⊗n


θ : θ ∈ Θ}. Let θ0 ∈ O,
O ⊂ Θ be open. Consider the testing problem:

H0 : P⊗n ⊗n
θ0 versus H1 : {Pθ : θ ∈ Θ \ O}

If there exists a sequence of nonrandomized tests ϕn and positive constants


C and β such that

Eθ0 ϕn (X(n) ) + sup Eθ (1 − ϕn (X(n) )) ≤ C exp(−nβ), (5.34)


θ∈Θ\O

then condition (5.15) is fulfilled in the Bayes model {{P⊗n


θ : θ ∈ Θ}, π}.

Proof: Set C1,n = {x(n) : ϕn (x(n) ) = 1}, the critical region of test ϕn , and
C0,n = {x(n) : ϕn (x(n) ) = 0} its complement. Recall the definition of Pn,Oc
in (5.16), and set πr (θ) for the prior restricted on Θ \ O. Then by Schwarz’
inequality we have
   12
H 12 (P⊗n
1
θ0 , Pn,O )
c = p(x(n) |θ0 ) 2 p(x(n) |θ0 )πr (θ) dθ dx(n)

≤ P⊗n ⊗n
1 1 1 1
θ0 (C1,n ) Pn,O (C1,n ) + Pθ0 (C0,n ) Pn,O (C0,n )
2 c 2 2 c 2

≤ P⊗n
1 1
θ0 (C1,n ) + Pn,O (C0,n ) .
2 c 2

Since
P⊗n
θ0 (C1,n ) = Eθ0 ϕn (X(n) )
LIST OF PROBLEMS 125
and
 
Pn,Oc (C0,n ) = p(x(n) |θ)πr (θ) dθ dx(n)
C0,n

≤ sup P⊗n
θ (C0,n )
θ∈Θ\O

≤ sup Eθ (1 − ϕn (X(n) ))
θ∈Θ\O

we obtain (5.15) from (5.34).


2
We recommend Liese and Miescke (2008, Section 7.5) for more results.

5.3 List of Problems


1. Consider an i.i.d. sample X(n) = (X1 , . . . , Xn ) from Exp (θ), where θ is the
rate parameter. Assume the prior θ ∼ Gamma(α, β).
(a) Derive the posterior distribution.
(b) Calculate the Bayes estimator δ(x(n) ) under the L2 loss.
(c) Show the consistency of δ(x(n) ).
2. Assume that Pθ belongs to an exponential family.
(a) Calculate the Kullback–Leibler divergence K(Pθ0 |Pθ ).
(b) Specify the result (a) for Gamma(α, β).
3. Assume that Pθ is the uniform distribution U(0, θ).
(a) Calculate the Kullback–Leibler divergence K(Pθ0 |Pθ ).
(b) Consider the Kullback–Leibler divergence as a function of θ. Is it con-
tinuous at θ0 ?
4. Consider an i.i.d. sample X(n) = (X1 , . . . , Xn ) from Gamma(ν, θ). The
parameter ν is known. Consider two different conjugate priors π1 , π2 :
θ ∼ Gamma(α1 , β), θ ∼ Gamma(α2 , β). Show that the respective poste-
riors fulfill (5.7). (Hint: Apply the Pinsker’s inequality ,
$
1
sup |P(A) − Q(A)| ≤ K(P|Q), (5.35)
A 2
and show that the Kullback–Leibler divergence converges to zero.)
5. Consider an i.i.d. sample X(n) = (X1 , . . . , Xn ) from N(θ, σ 2 ). The param-
eter σ 2 is known. Suppose θ ∼ N(μ, σ02 ). Show that the conditions (5.14)
and (5.15) of Theorem 5.3 are fulfilled.
6. Consider an i.i.d. sample X(n) = (X1 , . . . , Xn ) from N(a + b, 1). The pa-
rameter of interest is θ = (a, b). Suppose θ ∼ N2 (0, I2 ).
(a) Show that the Bayes estimator is not consistent.
(b) Show that the sequence of posteriors is not consistent.
(c) Show that the condition (5.15) is violated.
Chapter 6

Normal Linear Models

Linear models constitute an important part of statistical inference, with re-


gression and analysis of variance models as their two main components. Searle
(1971), later Searle and Gruber (2017), is a prominent classical reference. The
article Lindley and Smith (1972) set the stage for Bayesian theory of linear
models, although Zellner (1971) is an even older work. Other important ref-
erences include Box and Tiao (1973), Koch (2007) and Broemeling (2016).

This chapter deals with Bayesian theory of linear models, beginning with a
detailed treatment of univariate case, followed by a brief multivariate exten-
sion. Our focus will be on Bayesian analysis of linear models under normality
assumption, i.e., models of the form

P = {Nn (Xβ, Σ(ϑ)) : β ∈ Rp , Σ(ϑ)  0, ϑ ∈ Rq } , (6.1)

where X is a matrix of known constants and β is the unknown parameter


vector. Note that, ϑ is an intrinsic parameter, as variance component of the
model. Thus, we shall mostly have Σ(ϑ) = σ 2 Σ, so that ϑ = σ 2 , in which case
θ = (β, σ 2 ) and the entire parameter space for the model is Θ ⊂ Rp × R+ .

We recall that the linearity of such models follows from that of β in the
expectation, Xβ. We shall be mainly concerned with two Bayes models based
on (6.1), namely,
{P, πc } and {P, πJeff },
where πc and πJeff stand, respectively, for conjugate and Jeffreys priors; see
Chapter 3 for details. Under this setting, our aim will be to derive poste-
rior distributions, particularly focusing on closed-form expressions. The conju-
gate families, as the normal-inverse-gamma distributions or the normal-inverse
Wishart distributions, are well studied and computer-intensive methods are
not needed. Unfortunately, there is no standard parametrization for these dis-
tributions in the literature. We introduce the distributions in a general set
up in the running text and sign a gray frame around them for a better read-
ing. Furthermore these explicit posteriors give a chance to test the simulation
methods.

DOI: 10.1201/9781003221623-6 126


UNIVARIATE LINEAR MODELS 127
6.1 Univariate Linear Models
A univariate linear model can be stated, in matrix form, as

y = Xβ + , (6.2)

where y ∈ Rn is the vector of response variables, X ∈ Rn×p is the matrix of


known constants, β ∈ Rp is the vector of unknown parameters, and ∈ Rn is
the vector of unobservable random errors. Expanded in full form, model (6.2)
can be expressed as
⎛ ⎞ ⎛ ⎞⎛ ⎞ ⎛ ⎞
y1 x11 x12 . . . x1p β1 1
⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ y2 ⎟ ⎜ x21 x22 . . . x2p ⎟ ⎜ β2 ⎟ ⎜ 2 ⎟
⎜ ⎟=⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ .. ⎟ ⎜ .. .. .. .. ⎟ ⎜ .. ⎟ + ⎜ .. ⎟ , (6.3)
⎜.⎟ ⎜ . . . . ⎟⎜ . ⎟ ⎜ . ⎟
⎝ ⎠ ⎝ ⎠⎝ ⎠ ⎝ ⎠
yn xn1 xn2 ... xnp βp n

where the first column of X is often a vector of ones, denoted 1n . Model (6.2)
expresses each of n observations in y as a linear combination of unknown
parameters in β with coefficients from X, i.e.,

p
yi = xTi β + i = xij βj + i , (6.4)
j=1

i = 1, . . . , n, where xi ∈ Rp is the ith row of X. First we illustrate model (6.2)


by several examples.

Example 6.1 (Corn plants)


To assess whether corn plants fetch their Phosphorus content from organic
or inorganic sources, concentrations for two types of organic (X1 , X3 ) and
an inorganic X2 source are measured (in ppm) on n = 17 soil samples, along
with the Phosphorus content in corn plants as study variable, Y . The data
are taken from Snedecor and Cochran (1989). The corresponding linear model
can be stated as

yi = β0 + β1 x1i + β2 x2i + β3 x3i + i , i = 1, . . . , n, (6.5)

with y ∈ R17 , X ∈ R17×4 (with first column of 1s for intercept) and β =


(β0 , β1 , β2 , β3 )T . It is a univariate multiple linear regression model with
three independent variables. Following (6.3), the design matrix and parameter
vector, for n = 17 and p = 4, are
⎛ ⎞ ⎛ ⎞
1 x11 x12 x13 β
⎜ ⎟ ⎜ 0⎟
⎜ ⎟ ⎜ ⎟
⎜1 x21 x22 x23 ⎟ ⎜β 1 ⎟
X=⎜ ⎜ .. .. ..
⎟ ⎜ ⎟
.. ⎟ , β = ⎜ ⎟ .
⎜. . . . ⎟ ⎜β 2 ⎟
⎝ ⎠ ⎝ ⎠
1 x17,1 x17,2 x17,3 β3
128 NORMAL LINEAR MODELS
20 30 40 50 60

30
20
x1

10
5
0
60
50

x2
40
30
20

160
120
x3

80
40
0 5 10 15 20 25 30 40 60 80 100 140

Figure 6.1: Matrix scatter plot of (X1 , X2 , X3 ) for corn plants.

Figure 6.1 depicts matrix scatter plot of independent variables. The model
was in fact employed to study two phosphorous contents, i.e., two depen-
dent variables. Here, we use only the first of them, adjourning the treat-
ment of bivariate model as a special case of multivariate linear model in
Section 6.4. 2

Example 6.2 (Side effects)


Certain medicines of tuberculosis (TB) are suspected to affect vision, often
leading to complete vision loss. Surprisingly, in some cases, the vision returns
to normal once the patient stops medicine.
To investigate the issue, a researcher takes a random sample of n patients,
who have been on TB medicine for at least six months, and measures each
patient’s visual acuity for 3 weeks. To account for individual parameters, four
concomitant variables are also recorded on each patient: weight, diastolic and
systolic blood pressure, and weekly amount of medicine used. As the vision in
both eyes deteriorate equally, if it does at all, it is decided to use the average
over both eyes as response variable. The linear model is then

yij = xTi β + wi + ij , j = 1, . . . , r, i = 1, . . . , a, (6.6)


UNIVARIATE LINEAR MODELS 129
where yij is the average vision of ith patient in jth week, wj represents the
week effect, and xTi β collects four concomitant variables. Model (6.6) consists
of all fixed effects components, the first being regression component and second
the ANOVA component, resulting into an analysis of covariance (ANOCOVA)
model, which is also a special case of model (6.2). The design matrix and
parameter vectors, in full form, can be written as
⎛ ⎞
x11 x12 x13 x14 1 0 0
⎜ ⎟ ⎛ ⎞
⎜ ⎟
⎜x11 x12 x13 x14 0 1 0⎟ β
⎜ ⎟ ⎜ 1⎟
⎜ ⎟ ⎜ ⎟
⎜x11 x12 x13 x14 0 0 1⎟ ⎜ β2 ⎟
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
⎜x21 x22 x23 x24 1 0 0⎟ ⎜ β3 ⎟
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
X = ⎜x21 x22 x23 x24 0 1 0⎟ , β = ⎜ β4 ⎟ .
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
⎜x21 x22 x23 x24 0 0 1⎟ ⎜w 1 ⎟
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
⎜x31 x32 x33 x34 1 0 0⎟ ⎜w 2 ⎟
⎜ ⎟ ⎝ ⎠
⎜ ⎟
⎜x31 x32 x33 x34 0 1 0⎟ w3
⎝ ⎠
x31 x32 x33 x34 0 0 1

The first part of β represents X variables on each patient, where the second
part represents the week effect, and these parts correspond, respectively, to
first four and last three columns of X. 2

Our main focus is on full rank models, so that r(X) = p and Σ(ϑ) = σ 2 Σ,
where Σ is positive definite and known. Under this set up, the likelihood and
log-likelihood functions for model (6.2) with θ = (β, σ 2 ) can be stated as

 
1 1 1 T −1
(θ|y) = exp − (y − Xβ) Σ (y − Xβ) (6.7)
(2πσ 2 )n/2 |Σ| 12 2σ 2
n 1 1
l(θ|y) = − ln(2πσ 2 ) − ln(|Σ|) − 2 (y − Xβ)T Σ−1 (y − Xβ),(6.8)
2 2 2σ
which yield the maximum likelihood estimator, equivalently also the general-
ized least-squares estimator (GLSE), of β,

βΣ = (XT Σ−1 X)−1 XT Σ−1 y. (6.9)

Using the linearity of βΣ and the properties of multivariate normal distribu-


tions, we get
βΣ ∼ Np (β, σ 2 (XT Σ−1 X)−1 ). (6.10)
For Σ = In , it reduces to the ordinary least-squares estimator (OLSE) of β,

β = (XT X)−1 XT y, β ∼ Np (β, σ 2 (XT X)−1 ). (6.11)


130 NORMAL LINEAR MODELS

90
Y (Length of Life, years)

80

70

60

50

40

6 7 8 9 10 11 12
X (Length of Lifeline, cm)

Figure 6.2: Life length data.

For a short introduction into linear models we refer to Liero and Zwanzig
(2011, Chapter 6).

Example 6.3 (Life length vs. life line)


Draper and Smith (1966) report an interesting data set, collected to study
a popular belief that the length of one’s life span can be predicted from the
length of life line on one’s left hand. The data consist of n = 50 pairs of
observations on the person’s age close to death, y, measured in years, and
the length of lifeline, X, measured in centimeter; for details and reference
to original study, see Draper and Smith (1966, p.105). After removing two
outlying observations, the data for n = 48 observations yield
⎛ ⎞ ⎛ ⎞
1.365 −0.148 3232
(X X) = ⎝
T −1 ⎠, X y = ⎝
 ⎠
−0.148 0.016 29282

so that β = (87.89 − 2.26), i.e., the least-squares fitted line predicting life
length using life line as predictor, is yi = 87.89 − 2.26xi . The fitted line is also
shown in Figure 6.2, along with the scatter plot of observed data.
Roughly, the negative slope seems to counter the popular belief. In fact, a test
of H0 : β1 = 0 could not be rejected at any reasonable α (F = 2.09, p-value
= 0.155), indicating that the belief is nothing more than a mere perception.
Later we analyze the same data under Bayes model. 2

R Code 6.1.3. Life data set, Figure 6.2.

library(aprean3)
#"Applied Regression Analysis" by N.R.Draper and H.Smith
life<-dse03v # life data: data(dse03v)
attach(life); age<-life$x; length<-life$y
BAYES LINEAR MODELS 131
plot(length,age)
# outlier age=19, length=13.20
length[45]; age[1] # outlier
# new data without outlier
Age<-c(age[2:44],age[46:50])
Length<-c(length[2:44],length[46:50])
plot(Length,Age)
M<-lm(Age~Length); abline(M); summary(M)

6.2 Bayes Linear Models


In this section, we consider Bayes model {P, π} with P as given in (6.1). For
convenience, we write
% &
P = N(Xβ, Σ) : β ∈ Rp , Σ ∈ Rn×n , Σ  0 . (6.12)

Here, π is a known prior distribution assigned to the parameters. For example,


for Σ = σ 2 I, with σ 2 known, θ = β and a prior is assigned to β; otherwise
for Σ = σ 2 I, with σ 2 unknown, θ = (β, σ 2 ) and a joint prior is assigned to β
and σ 2 . We shall mainly focus on two types of priors, conjugate and Jeffreys,
since in these cases, a closed-form expression of posterior distribution of the
parameter of interest can be derived.

In the following subsection we consider only the linear parameter as unknown.


We develop methods for handling the linear parameter which will also be ap-
plied in the more complex model where the linear parameter β and the covari-
ance parameter are unknown. Especially Lemmas 6.1, 6.2, 6.3, and Theorems
6.1, 6.2 provide a useful toolbox.

6.2.1 Conjugate Prior: Parameter θ = β, σ 2 Known


We assume that the matrix Σ in model (6.12) is completely known. The un-
known parameter is the linear regression parameter β. It is possible to apply
Theorem 3.3 to derive the conjugate family. Here we set as prior the normal
distribution Np (γ, Γ) and calculate the posterior. We assume that the hyper-
parameters, γ and Γ, are known. For details on hyperparameters, see Chapter
3.
The Bayes model {P, πc } specializes to

{{Nn (Xβ, Σ) : β ∈ Rp }, Np (γ, Γ)} .

We have
y|β ∼ Nn (Xβ, Σ) and β ∼ Np (γ, Γ). (6.13)
As the set up is all about joined, marginal and conditional normal distribution,
the following two lemmas deliver useful results for simplifying the calculations.
Lemma 6.1 states that if the joint distribution is multivariate normal, then
132 NORMAL LINEAR MODELS
all marginal and conditional distributions are also multivariate normal of re-
spective dimensions; see e.g., Mardia et al. (1979).

Lemma 6.1 Let X ∼ Np (μ, Σ), Σ  0, be partitioned so that


⎛ ⎞ ⎛⎛ ⎞ ⎛ ⎞⎞
X1 μ1 Σ11 Σ12
X = ⎝ ⎠ ∼ Np ⎝⎝ ⎠ , ⎝ ⎠⎠ ,
X2 μ2 Σ21 Σ22

where μ1 ∈ Rr , Σ11 ∈ Rr×r etc. Then

X1 ∼ Nr (μ1 , Σ11 )
X2 |X1 = x1 ∼ Np−r (μ2|1 , Σ2|1 ),

where

μ2|1 = μ2 + Σ21 Σ−1 −1


11 (x1 − μ1 ) and Σ2|1 = Σ22 − Σ21 Σ11 Σ12 .

The converse of Lemma 6.1 does not hold in general. However, a converse is
possible if E(X2 |X1 = x1 ) is a linear function of x1 and Cov(X2 |X1 = x1 )
does not even depend on x1 . As this result, unlike Lemma 6.1, is rarely found
in statistical literature, we state it below.

Lemma 6.2 It is given that

X1 ∼ Nr (μ1 , Σ11 )
X2 |X1 = x1 ∼ Np−r (Ax1 + b, Λ),

with A and b as a matrix and a vector of constants, respectively, and matrix


Λ is independent of x1 . Then
⎛ ⎞ ⎛⎛ ⎞ ⎛ ⎞⎞
X1 μ1 Σ11 Σ11 AT
X = ⎝ ⎠ ∼ N p ⎝⎝ ⎠,⎝ ⎠⎠ .
X2 Aμ1 + b AΣ11 Λ + AΣ11 AT

Proof: We write the joint density, using the given marginal and conditional
normals, as

f (X) = f (X1 )f (X2 |X1 )


1
= (2π)−p/2 |Σ11 |−1/2 |Λ|−1/2 exp(− (Q1 + Q2 ), (6.14)
2
BAYES LINEAR MODELS 133
where
Q1 = (X1 − μ1 )T Σ−1
11 (X1 − μ1 )
Q2 = (X2 − AX1 − b)T Λ−1 (X2 − AX1 − b).

Partitioning this sum, term by term, and collecting quadratic and bilinear,
linear, and constant terms together, we have

Q1 + Q2 = XT C−1 T −1 T −1
1 X − 2X C2 + μ1 Σ11 μ1 + b Λ
T
b,

where
⎛ ⎞ ⎛ ⎞
Σ−1 T −1
11 + A Λ A −AT Λ−1 Σ−1 T −1
11 μ1 − A Λ b
C−1
1 =⎝ ⎠ , C2 = ⎝ ⎠.
−Λ−1 A Λ−1 Λ−1 b

Now, recall that, a random vector X ∈ Rp with X ∼ Np (μ, Σ), must have the
density:  
1
f (X) ∝ exp − (XT Σ−1 X − 2XT Σ−1 μ)
2
Thus, we have to show, that C1 = Σ and C2 = Σ−1 μ. We use the inverse of
a partitioned matrix as given in Seber (2008), as
⎛ ⎞−1 ⎛ ⎞
A−1 −1
11 + A11 A12 S
−1
A21 A−1 −A−1
11 A12 S
−1
A11 A12
⎝ 11 ⎠ =⎝ ⎠,
−S −1
A21 A−1
11 S −1
A21 A22

where S = A22 − A21 A−1


11 A12 is the Schur complement of A11 , assuming
A11  0. Applying this to C1 and comparing the blocks deliver Λ = S,
A11 = Σ11 and A−1
11 A12 = A. It follows that
⎛ ⎞
Σ11 Σ11 AT
C1 = ⎝ ⎠ = Σ,
AΣ11 Λ + AΣ11 AT

as required. Comparison of the linear components yields


⎛ ⎞⎛ ⎞ ⎛ ⎞
Σ11 Σ11 AT Σ−1 μ − A T −1
Λ b μ
C1 C2 = ⎝ ⎠ ⎝ 11 1 ⎠=⎝ ⎠,
1
T −1
AΣ11 Λ + AΣ11 A Λ b Aμ1 + b

the required mean vector.


2
The Bayes model in (6.13) fulfills the assumptions of Lemma 6.2. The data
generating distribution is normal with mean which is linear in β and a covari-
ance which is independent of β. We obtain the following theorem on the joint
distribution of β and y.
134 NORMAL LINEAR MODELS

Theorem 6.1 For

β ∼ Np (γ, Γ) and y|β ∼ Nn (Xβ, Σ)

the joint distribution of β and y is given as


⎛ ⎞ ⎛⎛ ⎞ ⎛ ⎞⎞
β γ Γ ΓXT
⎝ ⎠ ∼ Nn+p ⎝⎝ ⎠ , ⎝ ⎠⎠ . (6.15)
y Xγ XΓ Σ + XΓXT

In the next step we apply Lemma 6.1 to the joint multivariate normal distri-
bution in (6.15) and obtain the following theorem.

Theorem 6.2 For

β ∼ Np (γ, Γ) and y|β ∼ Nn (Xβ, Σ)

it holds that
 
y ∼ Nn (μy , Σy ) and β|y ∼ Np μβ|y , Σβ|y ,

with

μy = Xγ (6.16)
T
Σy = Σ + XΓX (6.17)
T −1
μβ|y = T
γ + ΓX (Σ + XΓX ) (y − Xγ) (6.18)
T −1
Σβ|y = Γ − ΓX (Σ + XΓX )
T
XΓ. (6.19)

Recall that, the posterior moments in (6.18)-(6.19) are expressed in terms of


prior covariance matrix Γ, so that Γ need not be invertible, rather the n × n
matrix XΓXT + Σ is required to be so, which suffices to keep the posterior
non-degenerate.

We also note, however, that the posterior moments often provide a better
insight when expressed in terms of precision matrices, Σ−1 and Γ−1 . The
following lemma on special matrix inverse identities will be useful in achieving
these objectives.
BAYES LINEAR MODELS 135

Lemma 6.3 Let A ∈ Rm×m and B ∈ Rn×n be non-singular matrices, and


C ∈ Rm×n and D ∈ Rn×m be any two matrices such that A + CBD is
non-singular. Then

(A + CBD)−1 = A−1 − A−1 C(B−1 + DA−1 C)−1 DA−1 . (6.20)

In particular,

(Im + CCT )−1 = Im − C(In + CT C)−1 CT (6.21)


T −1
(Im + CC ) C = C(In + CT C)−1 . (6.22)

Proof: The proof follows by showing that KK−1 = I for K = A + CBD.


Set
M = B−1 + DA−1 C.
We have
 
KK−1 = (A + CBD) A−1 − A−1 CM−1 DA−1
= I − CM−1 DA−1 + CBDA−1 − CBDA−1 CM−1 DA−1
 
= I − CB B−1 − M + DA−1 C M−1 DA−1
= I − CB (M − M) M−1 DA−1 = I.

If A = Im , B = In , and D = CT , then (6.20) reduces to (6.21). Further we


use (6.21) to show (6.22):

(Im + CCT )−1 C = C − C(In + CT C)−1 CT C


 
= C(In + CT C)−1 In + CT C − CT C
= C(In + CT C)−1 .
2
Following corollary to Theorem 6.2 utilizes the identities in Lemma 6.3 to
re-write the posterior moments in terms of Σ−1 and Γ−1 .

Corollary 6.1 Assuming Σ ∈ Rn×n and Γ ∈ Rp×p non-singular, the pos-


terior moments in Theorem 6.2 can be re-formulated as

μβ|y = Σβ|y (XT Σ−1 y + Γ−1 γ) (6.23)


−1 T −1 −1
Σβ|y = (Γ +X Σ X) . (6.24)
136 NORMAL LINEAR MODELS
Proof: We obtain (6.24) by applying (6.20) to (6.19) with A = Γ, C =
DT = X and B = Σ−1 . Now we show (6.23). From (6.18) we have

μβ|y = γ + ΓXT (XΓXT + Σ)−1 (y − Xγ)


= m1 + m2

with
 −1
m1 = γ − ΓXT XΓXT + Σ Xγ
 
= Γ − ΓX (XΓX + Σ) XΓ Γ−1 γ
T T −1

= Σβ|y Γ−1 γ

and
 −1
m2 = ΓXT XΓXT + Σ y
−1
= Γ 2 Γ 2 XT Σ− 2 Σ− 2 XΓXT Σ− 2 + In Σ− 2 y
1 1 1 1 1 1

1  −1 − 1
= Γ 2 C CT C + In Σ 2 y,

where C = Γ 2 XT Σ− 2 . Applying (6.22) gives


1 1

1  −1
CΣ− 2 y
1
m2 = Γ 2 Im + CCT
−1
= Γ 2 Im + Γ 2 XT Σ−1 XΓ 2 Γ 2 XT Σ−1 y
1 1 1 1

 −1 T −1
= Γ−1 + XT Σ−1 X X Σ y
= Σβ|y XT Σ−1 y.

Finally, substituting m1 and m2 in μβ|y gives (6.23).


2
Note that, we have

Cov(β) = Γ and Cov(β|y) = (Γ−1 + XT Σ−1 X)−1 .

It holds that
Cov(β)  Cov(β|y)
−1
since X Σ X  0. This implies that the posterior not only updates the
T

prior, using the information in data, rather also improves it by reducing its
variance.
Before we present special cases of (6.23) and (6.24) we show an alternative
method of their derivation.
Following the general notion (see Chapter 2), the posterior distribution of β|y
can be obtained as
π(β|y) ∝ π(β)(β|y).
BAYES LINEAR MODELS 137
For the set up in (6.13), the prior and the likelihood can be written, respec-
tively, as
 
1 1 T −1
π(β) = exp − (β − γ) Γ (β − γ)
(2π)p/2 |Γ|1/2 2
 
1 1 T −1
(θ|y) = exp − (y − Xβ) Σ (y − Xβ) ,
(2π)n/2 |Σ|1/2 2

where γ, Γ, and Σ are assumed known. Thus


 
1
π(β|y) ∝ exp − (Q1 + Q2 )
2

with
Q1 = (β − γ)T Γ−1 (β − γ)
Q2 = (y − Xβ)T Σ−1 (y − Xβ).

Consider Q1 + Q2 as function of β and collecting the quadratic, linear, and


constant terms together we obtain

Q1 + Q2 = β T (Γ−1 + XT Σ−1 X)β − 2β T (Γ−1 γ + XT Σ−1 )y + const1


= β T Γ−1 T −1
1 β − 2β Γ1 γ1 + const1

with
Γ1 = (Γ−1 + XT Σ−1 X)−1
γ1 = Γ1 (Γ−1 γ + XT Σ−1 y)
const1 = γ T Γ−1 γ + yT Σ−1 y.

Completing the squares by Q3 = γ1T Γ−1


1 γ1 , we obtain

Q1 + Q2 = β T Γ−1 T −1 T −1
1 β − 2β Γ1 γ1 + γ1 Γ1 γ1 + const1 − Q3
= (β − γ1 )T Γ−1
1 (β − γ1 ) + const2

where
const2 = γ T Γ−1 γ + yT Σ−1 y − γ1T Γ−1
1 γ1 .

The constant can also be re-written as

const2 = (y − Xγ1 )T Σ−1 (y − Xγ1 ) + (γ − γ1 )T Γ−1 (γ − γ1 ).

To see this we expand the two quadratic forms above as follows

yT Σ−1 y + γ T Γ−1 γ + Q4 .
138 NORMAL LINEAR MODELS
with
Q4 = −2γ1T XT Σ−1 y + γ1T XT Σ−1 Xγ1 − 2γ1 Γ−1 γ + γ1 Γ−1 γ1
= −2γ1T (XT Σ−1 y + Γ−1 γ) + γ1T (Γ−1 + XT Σ−1 X)γ1
= −2γ1T Γ−1 T −1
1 γ1 + γ1 Γ1 γ1
= −γ1T Γ−1 γ1
which gives the formula of const2 . Summarizing we obtain for the posterior
 
1
π(β|y) ∝ exp − (β − γ1 )T Γ−11 (β − γ 1 ) .
2
This is the kernel of Np (γ1 , Γ1 ). Comparing the formulas for γ1 and Γ1 with
(6.23) and (6.24) we obtain again

β|y ∼ Np (μβ|y , Σβ|y ).

For later application we summarize the calculations in the following lemma.

Lemma 6.4 It holds that

Q = (β − γ)T Γ−1 (β − γ) + (y − Xβ)T Σ−1 (y − Xβ)

can be expressed as

Q = (β −γ1 )T Γ−1 T −1
1 (β −γ1 )+(y−Xγ1 ) Σ (y−Xγ1 )+(γ −γ1 )T Γ−1 (γ −γ1 )
(6.25)
or alternatively as

Q = (β − γ1 )T Γ−1 T −1
1 (β − γ1 ) + y Σ y + γ T Γ−1 γ − γ1T Γ−1
1 γ1 (6.26)

with
γ1 = Γ1 (Γ−1 γ + XT Σ−1 y).
Γ1 = (Γ−1 + XT Σ−1 X)−1

6.2.1.1 Special Cases


We now explore a few special cases. To begin with, we assume that the errors
2
1 , . . . , n in (6.4) are i.i.d. N1 (0, σ ) distributed, such that for in (6.2)

∼ Nn (0, σ 2 In ),

which implies Cov(y) = σ 2 In . Note that, structures like σ 2 In are called spher-
ical because the level sets of the density are spheres. In this case, the posterior
BAYES LINEAR MODELS 139
distribution is N(μβ|y , Σβ|y ) with
 −1  
μβ|y = Γ−1 + σ −2 XT X σ −2 XT y + Γ−1 γ (6.27)
 −1
Σβ|y = Γ−1 + σ −2 XT X . (6.28)

Another special case is that each component of β has independent prior in-
formation with same precision, i.e., we assume

β ∼ Np (γ, τ 2 Ip ).

Then the moments of the posterior further simplify to


 −1  
μβ|y = τ −2 Ip + σ −2 XT X XT y + τ −2 γ (6.29)
 −1
Σβ|y = τ −2 Ip + σ −2 XT X . (6.30)

If we additionally assume that the design matrix X is orthogonal, such that


XT X = η 2 Ip , then the maximum likelihood estimator is given by

β = (XT X)−1 XT y = η −2 XT y.

We can express the posterior moments as


 −1  −2 T 
μβ|y = τ −2 Ip + σ −2 η 2 Ip σ X y + τ −2 γ
1
= η 2 β + γσ 2 /τ 2 (6.31)
σ /τ + η 2
2 2
 −2 −1
Σβ|y = τ Ip + σ −2 η 2 Ip
σ2
= Ip . (6.32)
σ 2 /τ 2 + η 2

The expectation of the posterior in (6.31) is the convex combination of the


maximum likelihood estimator and the prior expectation.
In particular, we notice that the prior precision and the posterior precision
coincide if we let σ 2 → ∞, keeping τ 2 and η 2 fixed, i.e., if we make the
data sacrifice its precision completely. Note also that, if we let τ 2 → ∞, i.e.,
the prior becomes non-informative, then the posterior precision reduces to
η 2 σ −2 Ip .

Example 6.4 (Life length vs. life line)


Here, we analyze the data for Example 6.3 under the model

yi = α + βxi + εi , i = 1, . . . , 48, εi ∼ N(0, σ 2 ) i.i.d. with σ 2 = 160, (6.33)


140 NORMAL LINEAR MODELS
using two conjugate priors for θ = (α, β), namely
⎛ ⎞ ⎛⎛ ⎞ ⎛ ⎞⎞
α 0 100 0
⎝ ⎠ ∼ N 2 ⎝⎝ ⎠ , σ ⎝ 2 ⎠⎠
β 0 0 100

and ⎛ ⎞ ⎛⎛ ⎞ ⎛ ⎞⎞
α 100 100 0
⎝ ⎠ ∼ N2 ⎝⎝ ⎠ , σ 2 ⎝ ⎠⎠ .
β 0 0 100
We apply Corollary 6.1 and obtain, for the first prior, the posterior
⎛ ⎞ ⎛⎛ ⎞ ⎛ ⎞⎞
α 86.69 215.40 −23.30
⎝ ⎠ |y ∼ N2 ⎝⎝ ⎠,⎝ ⎠⎠
β −2.13 −23.30 2.56

and for the second prior, the posterior


⎛ ⎞ ⎛⎛ ⎞ ⎛ ⎞⎞
α 88.04 215.40 −23.30
⎝ ⎠ |y ∼ N2 ⎝⎝ ⎠,⎝ ⎠⎠ .
β −2.27 −23.30 2.56

The respective marginal posteriors of the slope are

β|y ∼ N(−2.13, 2.56) and β|y ∼ N(−2.27, 2.56).

We see that the choice of prior of the intercept has an influence on the posterior
expectation of the slope. We observe much reduced variance in the posterior
of the slope compared with the prior. The posterior variance of the intercept
in model (6.33) is still high.
Because we are mostly interested in the slope, we center the response and the
X–variable. The new model is:

yi − ȳ = β(xi − x̄) + εi , i = 1, . . . , 48, εi ∼ N(0, σ 2 ) i.i.d., σ 2 = 160. (6.34)

We assume the prior β ∼ N(0, σ 2 100) and obtain β|y ∼ N(−2.257, 2.59). One
of posteriors of the slope is plotted in Figure 6.3. 2

R Code 6.2.4. Life data, Example 6.4.

library(matrixcalc)
sigma<-sqrt(160) # variance known
# alpha~N(a,sa^2); beta~N(b,sb^2) # prior
# First case
a<-0 # prior mean intercept
BAYES LINEAR MODELS 141

0.25
0.20
0.15
0.10
0.05
0.00

−5 0 5

slope

Figure 6.3: Example 6.4 on life length data. The broken line is the prior density,
N(0, σ 2 100), of the slope, and the continuous curve is the respective posterior density.

sa<-sigma*10 # prior sd intercept


b<-0# prior mean slope
sb<-sigma*10 # prior sd slope
Gamma<-matrix(c(sa^2,0,0,sb^2),ncol=2) # prior covariance
solve(Gamma) # inverse matrix
L<-Length;
XX<-matrix(c(48,sum(L),sum(L),sum(L*L)),ncol=2)
P<-XX*(sigma)^(-2)+solve(Gamma)## precision
Gamma1<-solve(P) # posterior covariance
xy<-c(sum(Age),sum(Age*L))
bpost<-Gamma1%*%(sigma^(-2)*xy+solve(Gamma)%*%c(a,b))
bpost # posterior mean of intercept and slope
# centered model
Agec<-Age-mean(Age); y<-Agec
Lengthc<-L-mean(L); x<-Lengthc
# Bayes calculation in centered model
bb<-0 # prior mean
varbb<-100*160 # prior variance
Gamma1c<-(varbb^(-1)+(sigma)^(-2)sum(x*x))^(-1)
Gamma1c # posterior variance
gamma1bb<-Gamma1c*(sigma^(-2)*sum(x*y)+bb*varbb^(-1))
gamma1bb # posterior mean
142 NORMAL LINEAR MODELS
Example 6.5 (Corn plants)
Recall Example 6.1 on corn data. We assume that all parameters are indepen-
dent and have the same prior N(0, 10). The errors in (6.5) are independent
normally distributed with σ 2 = 300. We apply Corollary 6.1. The resulting
posterior distribution of β = (β0 , β1 , β2 , β3 ) is β|y ∼ N4 (μβ|y , Σβ|y ) with
⎛ ⎞ ⎛ ⎞
2.22 9.65 −0.01 −0.12 −0.03
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
⎜1.34⎟ ⎜−0.01 0.23 −0.07 0.004 ⎟
μβ|y = ⎜⎜
⎟ , Σβ|y = ⎜
⎟ ⎜
⎟.

⎜0.65⎟ ⎜−0.12 −0.07 0.09 −0.02⎟
⎝ ⎠ ⎝ ⎠
0.24 −0.03 0.004 −0.02 0.007

The prior and marginal posterior distributions of the three components of β


are depicted in Figure 6.4. 2

We finish this section by discussing a posterior distribution for a one-way


ANOVA model, followed by two related examples. For simplicity, we only
focus on fixed effects models, i.e.,

yij = μi + ij , ij ∼ N(0, σ 2 ) (6.35)


4

prior = N(0,10)
beta 1
beta 2
3

beta 3
2
1
0

−1 0 1 2 3

parameter

Figure 6.4: Example 6.5. Prior and posterior distributions of β1 , β2 , β3 for corn data
with conjugate prior.
BAYES LINEAR MODELS 143
where yij are independent observations measured on jth subject in ith group,
j = 1, . . . , ri , i = 1, . . . , a. It holds that yij ∼ N(μi , σ 2 ), hence


ri
y i ∼ N(μi , σ 2 /ri ) where y i = yij /ri .
j=1

We let μi ∼ N(γi , τ 2 ) be the prior, independent for each μi and assume γi and
τ 2 , along with σ 2 , known. Model (6.35) consists of a independent samples.
For known variance σ 2 the statistic y i is sufficient in each sample. Following
theorem gives the posterior distribution for each group.

Theorem 6.3 Given  y i ∼ N(μi , σ 2 /ri ) with known σ 2 and the prior μi ∼
ri
N(γi , τ 2 ), where y i = j=1 yij /ri , i = 1, . . . , a. The posterior distribution
of μi |y i follows for each group i as

μi |y i ∼ N(Ai , m2i ) (6.36)

where
y i τ 2 + γi σ 2 /ri τ 2 σ 2 /ri
Ai = 2 2
and m2i = 2 . (6.37)
τ + σ /ri τ + σ 2 /ri

Proof: Apply the results of Example 2.12.


2
In the following, we apply Theorem 6.3 on two examples, the data sets of
which are taken from Daniel and Cross (2013, Chapter 8).

Example 6.6 (Parkinson)


The aim of the experiment is to assess the effects of weights on postural hand
tremor during self-feeding in patients with Parkinson’s disease. A random sam-
ple of n = 48 patients is divided into a = 3 groups of r = 16 each. The three
groups pertain to three different conditions, namely holding a built-up spoon
(108 grams), holding a weighted spoon (248 grams), and holding a built-up
spoon while wearing a weighted wrist cuff (470 grams). The amplitude of the
tremor is measured on each patient (in mm).

Denoting the amplitude measured on jth patient under ith condition as yij ,
the model can be stated as

yij = μi + ij , i = 1, 2, 3, j = 1, . . . , 16.
144 NORMAL LINEAR MODELS

π(μi)
π(μ1 | y1)
π(μ2 | y2)
yi

π(μi)
π(μ3 | y3)

1 2 3
i μi

Figure 6.5: Illustration for Example 6.6. Left: Bar plot of the Parkinson data. Right:
Prior and posterior distributions of μi .

The three sample means, y i , are 0.495, 0.575, 0.535, with the corresponding
variances as 0.144, 0.234 and 0.208. The left panel of Figure 6.5 shows the
basic statistics as bar plots.

For the analysis, we use μi ∼ N(0.5, 0.3) as prior and take ij ∼ N(0, 0.2).
This leads to the posterior distributions as N(0.496, 0.012), N(0.574, 0.012)
and N(0.534, 0.012), respectively for μ1 , μ2 and μ3 . The prior and posteriors
are plotted in the right panel of Figure 6.5. 2

Example 6.7 (Smokers)


The data consist of serum concentration of lipoprotein on a random sample of
r = 7 individuals classified under each of a = 4 groups as non-smokers, light
smokers, moderate smokers and heavy smokers. The objective of the study is
to see if the average serum concentration differs across four groups.

The model for concentration measured on jth subject in ith group is stated
as
yij = μi + ij , i = 1, . . . , 4, j = 1, . . . , 7.
The sample means and variances of the four groups are, respectively, (10.857,
8.286, 6.143, 3.286) and (2.476, 2.571, 2.810, 3.238). The downward trend of
the averages across four groups and almost identical variances are also clear
from the bar plot of the data shown in the left panel of Figure 6.6.

Using μi ∼ N(5, 2.5) as prior for each group and ∼ N(0, 2.5), the posterior
distributions as obtained as N(10.125, 0.313), N(7.875, 0.313), N(6, 0.313) and
N(3.5, 0.313), respectively for μ1 , μ2 , μ3 and μ4 , and the same are plotted,
along with the prior, in the right panel of Figure 6.6. 2
BAYES LINEAR MODELS 145

π(μi)
π(μ1 | y1)
yi

π(μ2 | y2)

π(μi)
π(μ3 | y3)
π(μ4 | y4)

1 2 3 4
i μi

Figure 6.6: Example 6.7. Left: Bar plot of smokers data. Right: Prior and posterior
distributions of μi .

6.2.2 Conjugate Prior: Parameter θ = (β, σ 2 )


Consider model (6.12) where the covariance matrix is known up to a scaling
parameter. We assume the model
% &
P = Nn (Xβ, σ 2 Σ) : β ∈ Rp , σ 2 ∈ R, σ 2 > 0 . (6.38)

The unknown parameter θ consists of the linear regression parameter β and


the scalar parameter σ 2 . The design matrix X ∈ Rn×p and the positive definite
matrix Σ ∈ Rn×n are known. This model belongs to an exponential family
of order p + 1. It is possible to apply Theorem 3.3 for deriving the conjugate
family. Here we take the direct way. First we introduce the distribution family
from which we chose the prior and later we show that the posterior also belongs
to the same distribution.
We already know that the conjugate prior of β given σ 2 is a normal distribu-
tion. Further we know that the conjugate prior of σ 2 in a normal model with
known expectation is an inverse–gamma distribution, see Example 2.14. The
conjugate family for the joint parameter θ = (β, σ 2 ) is a combination of both.
We assume that β|σ 2 ∼ Np (γ, σ 2 Γ) and σ 2 ∼ InvGamma(a/2, b/2). We write
the entire set up hierarchically as

y|β, σ 2 ∼ Nn (Xβ, σ 2 Σ)
β|σ 2 ∼ Np (γ, σ 2 Γ) (6.39)
σ 2
∼ InvGamma(a/2, b/2).

Note that, β and σ 2 are not independent. The joint distribution of θ = (β, σ 2 )
is called normal-inverse-gamma (NIG) distribution . We introduce the distri-
bution in a general setting as follows.
146 NORMAL LINEAR MODELS

A random vector X ∈ Rp and a positive random scalar λ ∈ R+ have


jointly a normal-inverse-gamma (NIG) distribution

(X, λ) ∼ NIG(α, β, μ, Σ)

iff
X|λ ∼ Np (μ, λΣ) and λ ∼ InvGamma(α/2, β/2) (6.40)
with density
 
− p+α+2 1  T −1

f (X, λ) = c λ 2exp − β + (X − μ) Σ (X − μ) .

(6.41)
The constant c is given by
α
1 b2
c= α . (6.42)
(2π)p/2 |Σ|1/2 2 2 Γ( α2 )

The Bayes model {P, πc } specializes to


% &
{ Nn (Xβ, σ 2 Σ) : β ∈ Rp , σ 2 ∈ R+ }, NIG(a, b, γ, Γ) . (6.43)

The following theorem shows that a normal-inverse-gamma prior leads to a


normal-inverse-gamma posterior with updated parameters.

Theorem 6.4 For the Bayes model (6.43), the joint posterior distribution
of (β, σ 2 ) is the normal-inverse-gamma (NIG) distribution,

(β, σ 2 )|y ∼ NIG(a1 , b1 , γ1 , Γ1 ), (6.44)

with
a1 = a + n
b1 = b + (y − Xγ1 )T Σ−1 (y − Xγ1 ) + (γ − γ1 )T Γ−1 (γ − γ1 )
(6.45)
γ1 = Γ1 (XT Σ−1 y + Γ−1 γ)
Γ1 = (XT Σ−1 X + Γ−1 )−1 .
BAYES LINEAR MODELS 147
Proof: Consider prior in (6.43). Taking the likelihood function,
 
1 1 −1
(y|β, σ 2 ) = exp − (y − Xβ)Σ (y − Xβ) ,
(2π)n/2 (σ 2 )n/2 |Σ|1/2 2σ 2

into account, the posterior can be written as


 
1 1
π(β, σ |y) ∝
2
exp − 2 (Q + b) ,
(σ 2 )(n+p+a+2)/2 2σ

with
Q = (y − Xβ)T Σ−1 (y − Xβ) + (β − γ)T Γ−1 (β − γ).
Applying Lemma 6.4, (6.25) delivers the statement.
2
The posterior moments in Theorem 6.4 include the inverse matrices of Σ and
Γ. In case σ 2 is given, two different representations of the posterior moments
follow from Theorem 6.2 and Corollary 6.1. The formulas in (6.45) are related
to Corollary 6.1. We derive the presentations of the posterior moments related
to Theorem 6.4 for θ = (β, σ 2 ) in the following corollary. One advantage of
the new formulas is that the inverses of Σ and Γ are not needed.

Corollary 6.2 For the Bayes model (6.43), the joint posterior distribution
of (β, σ 2 ) is the normal-inverse gamma (NIG) distribution,

(β, σ 2 )|y ∼ NIG(a1 , b1 , γ1 , Γ1 ), (6.46)

with
a1 = a + n
b1 = b + (y − Xγ)T (Σ + XΓXT )−1 (y − Xγ)
(6.47)
γ1 = γ + ΓXT (Σ + XΓXT )−1 (y − Xγ)
Γ1 = Γ − ΓXT (Σ + XΓXT )−1 XΓ.

Proof: The formulas for γ1 and Γ1 in (6.45) are the same as in Corollary
6.1. We can apply Theorem 6.2. It remains to check the expression of b1 . Set

M = Σ + XΓXT . (6.48)
148 NORMAL LINEAR MODELS
−1
We have to show that Q = (y − Xγ) M T
(y − Xγ) = Q1 + Q2 , where

Q1 = (y − Xγ1 )T Σ−1 (y − Xγ1 ), Q2 = (γ − γ1 )T Γ−1 (γ − γ1 ).

We start with Q2 and apply

γ1 = γ + ΓXT M−1 (y − Xγ) (6.49)

thus
Q2 = (y − Xγ)T M−1 XΓXT M−1 (y − Xγ). (6.50)
Using (6.49) and (6.48), we get

y − Xγ1 = y − Xγ − XΓXT M−1 (y − Xγ)


= (M − XΓXT )M−1 (y − Xγ)
= ΣM−1 (y − Xγ),

and
Q1 = (y − Xγ)T M−1 ΣM−1 (y − Xγ).
Applying (6.48) we obtain

Q = Q1 + Q2
= (y − Xγ)T M−1 (Σ + XΓXT )M−1 (y − Xγ)
= (y − Xγ)T M−1 (y − Xγ),

which completes the proof.


2
2 2
Theorem 6.4 gives the joint posterior distribution of θ = (β, σ ). If σ is
the parameter of interest and β is the nuisance parameter, then the condi-
tional distribution of σ 2 given y is the most important. By the hierarchical
construction of the normal-inverse-gamma distribution NIG(a1 , b1 , γ1 , Γ1 ) we
know that
σ 2 |y ∼ InvGamma(a1 /2, b1 /2).
In case β is the parameter of interest and σ 2 is the nuisance parameter we are
interested in the marginal posterior distribution of β|y. The following general
lemma gives the marginal distributions of a normal-inverse-gamma distribu-
tion. Before we state it, we introduce the multivariate t-distribution.
BAYES LINEAR MODELS 149

A random vector X ∈ Rp follows a multivariate t–distribution, denoted

X ∼ tp (ν, μ, Σ),

if it has the density


   −(ν+p)/2
Γ ν+p 1 T −1
ν  2
1 + (X − μ) Σ (X − μ) , (6.51)
Γ 2 ν p/2 π p/2 |Σ|1/2 ν

where ν denotes the degrees of freedom, μ ∈ Rp is the location param-


eter and Σ ∈ Rp×p is the positive definite scale matrix.

Lemma 6.5 Assume (X, λ) ∼ NIG(α, β, μ, Σ). Then

β
X ∼ tp (α, μ, Σ) and λ ∼ InvGamma(α/2, β/2).
α

∞
Proof: To derive f (X), we need to integrate λ out of f (X) = f (X, λ)dλ,
0

where
 
−(p+α+2)/2 1
f (X, λ) ∝ λ exp − (Q + β) with Q = (X − μ)T Σ−1 (X − μ).

We use the substitution Q+β


2λ = x such that dλ = − Q+β
2x2 dx and get

∞
−(p+α)/2
f (X) ∝ (Q + β) x(p+α)/2−1 exp(−x)dx.
0

Recall
 ∞ a−1that, for any real a > 0, the gamma function is defined as Γ(a) =
0
x exp(−x)dx; see e.g., Mathai and Haubold (2008). Thus, the integral
in f (X) is a gamma function with a = (p + α)/2, and we can write
1
f (X) ∝ (Q + β)−(p+α)/2 ∝ (1 + (X − μ)T Σ−1 (X − μ))−(p+α)/2 . (6.52)
β

Comparing this kernel with (6.51) delivers the statement. The marginal dis-
tribution of λ is given by the hierarchical set up of NIG.
2
150 NORMAL LINEAR MODELS
For a better overview, we summarize once more all results related to the con-
jugate prior set up in Theorem 6.5.

Theorem 6.5 Given

y|β, σ 2 ∼ Nn (Xβ, σ 2 Σ) (6.53)


β|σ 2
∼ 2
Np (γ, σ Γ) (6.54)
σ 2
∼ InvGamma(a/2, b/2), (6.55)

then
b
y ∼ tn (a, m, M) (6.56)
a
β|y, σ 2 ∼ Np (γ1 , σ 2 Γ1 ) (6.57)
b1
β|y ∼ tp (a1 , γ1 , Γ1 ) (6.58)
a1
σ 2 |y ∼ InvGamma(a1 /2, b1 /2). (6.59)

with

m = Xγ (6.60)
M = Σ + XΓXT (6.61)
γ1 = Γ1 (XT Σ−1 y + Γ−1 γ) (6.62)
−1
= γ + ΓX M T
(y − Xγ) (6.63)
−1 T −1 −1
Γ1 = (Γ +X Σ X) (6.64)
= Γ − ΓXT M−1 XΓ. (6.65)
a1 = a+n (6.66)
T −1 T −1
b1 = b + (y − Xγ1 ) Σ (y − Xγ1 ) + (γ − γ1 ) Γ (γ − γ1 )(6.67)
= b + (y − Xγ1 )T M−1 (y − Xγ1 ). (6.68)
= b + yT Σ−1 y + γ T Γ−1 γ − γ1T Γ−1
1 γ1 . (6.69)

Proof: The equivalent formulas (6.62) and (6.63) are given in Theorem 6.4
and Corollary 6.2. The same is true for (6.64) and (6.65). The three equivalent
presentations of b1 are given in Theorem 6.4, Corollary 6.2 and Lemma 6.4.
The t–distribution of β in (6.58) is a consequence of Lemma 6.5. It remains
to show (6.56). From Theorem 6.1 it follows for given σ 2 that
⎛ ⎞ ⎛⎛ ⎞ ⎛ ⎞⎞
2 2 T
β γ σ Γ σ ΓX
⎝ ⎠ |σ 2 ∼ Nn+p ⎝⎝ ⎠ , ⎝ ⎠⎠ .
y m σ 2 XΓ σ2 M
BAYES LINEAR MODELS 151
This includes
y|σ 2 ∼ Nn (m, σ 2 M),
which, together with (6.55), implies

(y, σ 2 ) ∼ NIG(a, b, m, M).

Lemma 6.5 delivers (6.56).


2
Before we introduce non-informative priors, we illustrate some of the expres-
sions in Theorem 6.5 by simple examples.

Example 6.8 (Simple linear regression) We assume

yi = βxi + εi , i = 1, . . . , n, εi ∼ N(0, σ 2 ), i.i.d.

with θ = (β, σ 2 ). The prior NIG(a, b, γ, Γ) means

β|σ 2 ∼ N(γ, σ 2 Γ) and σ 2 ∼ InvGamma(a/2, b/2).

The hyperparameter γ includes a first guess for β; the variance σ 2 Γ describes


the precision of the guess. Here Γ is a scalar, we set Γ = λ. It is the prior
information about the ratio of the prior variance σ 2 to the error variance.
The hyperparameter a can be interpreted as a type of sample size on which
b
the prior information on σ 2 is based. The guess for σ 2 is given by a−2 . The
formulas (6.62), (6.64) and (6.69) become
−1

n 
n
−1
γ1 = x2i +λ xi yi + λ−1 γ
i=1 i=1
−1

n
Γ1 = λ 1 = x2i + λ−1
i=1
−1 2

n 
n 
n
−1 2 −1 −1
b1 = b + yi2 +λ γ − x2i +λ x i yi + λ γ .
i=1 i=1 i=1

In the following example we consider again the simple linear regression model,
but now we include the intercept as parameter. Also in this case the formulas
can be written explicitly.

Example 6.9 (Simple linear regression with intercept)


We assume

yi = α + βxi + εi , i = 1, . . . , n, εi ∼ N(0, σ 2 ), i.i.d.


152 NORMAL LINEAR MODELS
2
with θ = (α, β, σ ). The prior is NIG(a, b, γ, Γ). We assume that intercept and
slope have conditionally independent priors

α|σ 2 ∼ N(γa , σ 2 λ1 ) and β|σ 2 ∼ N(γb , σ 2 λ2 ),

with
σ 2 ∼ InvGamma(a/2, b/2).
Then Σ = I2 , ⎛ ⎞ ⎛ ⎞
γa λ1 0
γ=⎝ ⎠ and Γ = ⎝ ⎠.
γb 0 λ2
We apply (6.62) and (6.64). It holds that
⎛ n ⎞
−1
n + λ x
XT X + Γ−1 = ⎝ i=1 i ⎠.
1
n n 2 −1
i=1 x i i=1 x i + λ 2

Using the inversion formula


⎛ ⎞ ⎛ ⎞
a c 1 b −c
⎝ ⎠= ⎝ ⎠
c b ab − c2 −c a

and

n 
n
x2i = sxx + nx̄2 , with sxx = (xi − x̄)2
i=1 i=1

we obtain ⎛ ⎞
−1
1 ⎝sxx + nx̄2 + λ2 −nx̄
Γ1 = ⎠
d −nx̄ n + λ−1
1

with
d = (n + λ−1 −1 −1 2
1 )(sxx + λ2 ) + λ1 nx̄ . (6.70)
Further ⎛  ⎞
n −1
y + λ γ
XT y + Γ−1 γ = ⎝ ⎠
i=1 i 1 a
n −1
i=1 x i y i + λ 2 γ b

and

n 
n
xi yi = sxy + nx̄ȳ, with sxy = (xi − x̄)(yi − ȳ)
i=1 i=1

so that
1 
γ1,a = nsxx α + nλ−1 −1
2 αprior + λ1 γa + λ2
−1
d (6.71)
1
γ1,b = (n + λ−1 −1 −1
1 )(sxx β + λ2 γb ) + nx̄λ1 (ȳ − γa )
d
BAYES LINEAR MODELS 153
with

α = ȳ − β x̄
αprior = ȳ − γb x̄ (6.72)
sxy
β = . (6.73)
sxx
2

Example 6.10 (Life length vs. life line)


We continue with Example 6.3 using the centered model (6.34)

yi − ȳ = β(xi − x̄) + εi , i = 1, . . . , 48, εi ∼ N(0, σ 2 ), i.i.d.

with unknown slope β and unknown variance σ 2 , thus θ = (β, σ 2 ). We assume


the conjugate prior NIG(10, 1280, 0, 100), i.e., Eσ 2 = 160, since

β|σ 2 ∼ N(0, 100σ 2 ), σ 2 ∼ InvGamma(5, 640), , β ∼ t1 (10, 0, 12800).

We apply (6.62), (6.64), (6.69) and obtain the posteriors

β|y ∼ t1 (58, −2.257, 2.29), σ 2 |y ∼ InvGamma(29, 4096).

In Figure 6.7 the posteriors together with the prior are plotted. 2

R Code 6.2.5. Life data, Example 6.10.

y<-Agec; x<-Lengthc
# prior NIG(a,b,gamma,Gamma)
a<-10; b<-2*640
gamma<-0; Gamma<-100
## posterior NIG(a1,b1,gamma1,Gamma1)
a1<-a+n
Gamma1<-(Gamma^(-1)+sum(x^2))^(-1)
gamma1<-Gamma1*(sum(x*y)+Gamma^(-1)*gamma)
b1<-b+sum(y*y)+gamma^2*Gamma^(-1)-gamma1^2*Gamma1^(-1)
b1/a1*Gamma1

6.2.3 Jeffreys Prior


In Chapter 3 the background of Jeffreys prior as non-informative prior is
explained. We assume the model
% &
P = Nn (Xβ, σ 2 Σ) : θ = (β, σ 2 ) ∈ Rp × R+ . (6.74)
154 NORMAL LINEAR MODELS
0.015

0.15
prior
prior
posterior posterior
0.010

0.10
0.005

0.05
0.000

0.00
50 100 150 200 250 300 350 400
−10 −5 0 5
variance
slope

Figure 6.7: Example 6.10 on life length data. Left: The broken line is the prior
InvGamma(5, 640) of the error variance, and the continuous line is the posterior
InvGamma(29, 4096). Right: The broken line is the prior density t1 (10, 0, 12800) of
the slope, and the continuous line is the posterior density t1 (58, −2.26, 2.29).

Under this assumption Jeffreys prior and the reference prior coincide, see
(3.51) and the discussion in Subsection 3.3.3. In Definition 3.4 Jeffreys prior
is defined as the square root of the determinant of Fisher information.
We begin by computing Fisher information.

Theorem 6.6 The Fisher information matrix for model (6.74) is


⎛ ⎞
1 T −1
X Σ X 0
I(θ) = ⎝ σ ⎠.
2
(6.75)
n
0 2σ 4

Proof: From (3.32), the Fisher information I(θ) is given by

I(θ) = −Eθ J(θ|x),

where J(θ|x) is the matrix of the second derivatives, Hessian, of the log-
likelihood function. We have

y|θ ∼ Nn (Xβ, σ 2 Σ), (6.76)

with θ = (β, σ 2 ), so that the log-likelihood function is


n 1 1
l(θ|y) = − ln(σ 2 π) − ln(|Σ|) − 2 (y − Xβ)T Σ−1 (y − Xβ).
2 2 2σ
BAYES LINEAR MODELS 155
2
Differentiating with respect to β and σ , the gradient function is

∂l/∂θ = (∂l/∂β, ∂l/∂σ 2 )T

with
∂l 1 T −1
= X Σ (y − Xβ)
∂β σ2
∂l n 1
= − 2+ (y − Xβ)T Σ−1 (y − Xβ).
∂σ 2 2σ 2(σ 2 )2

Differentiating again,

∂2l 1 T −1
= − X Σ X
∂ββ T σ2
∂2l n 1
= − 2 3 (y − Xβ)T Σ−1 (y − Xβ)
∂(σ 2 )2 2(σ 2 )2 (σ )
∂2l 1
= − 2 2 XT Σ−1 (y − Xβ)
∂β∂σ 2 (σ )

so that
⎛ ⎞
− σ12 XT Σ−1 X − σ14 XT Σ−1 (y − Xβ)
J(θ|y) = ⎝ ⎠.
1
σ 4 (y − Xβ)T XT Σ−1 n
2σ 4 − 1
σ 6 (y − Xβ)T Σ−1 (y − Xβ)

Fisher information matrix follows as I(θ) = −EJ(θ|y). It holds that Ey = Xβ,


and the Fisher information matrix is block diagonal. Further, we calculate
 
E(y − Xβ)T Σ−1 (y − Xβ) = E tr (y − Xβ)T Σ−1 (y − Xβ)
 
= E tr (y − Xβ)(y − Xβ)T Σ−1
 
= tr E(y − Xβ)(y − Xβ)T Σ−1
= tr(Cov(y)Σ−1 ) = tr(σ 2 ΣΣ−1 ) = σ 2 tr(In ) = σ 2 n.

Thus we obtain the statement.


2
Corollary 6.3 now gives Jeffreys prior.

Corollary 6.3 For y|θ ∼ Nn (Xβ, σ 2 Σ), the Jeffreys prior of θ = (β, σ 2 )
is given as
1
π(θ) ∝ . (6.77)
(σ 2 )p/2+1
156 NORMAL LINEAR MODELS
Proof: As I(θ) is a block diagonal matrix, we get see (see Seber, 2008)
1 T −1 n 1
det(I(θ)) = det( X Σ X) · ∝ 2 p+2 , (6.78)
σ2 2(σ 2 )2 (σ )

so that the Jeffreys prior simplifies to


#
 1 1
π(θ) = det(I(θ)) ∝ ∝ 2 p/2+1 . (6.79)
(σ 2 )p+2 (σ )

Following corollary provides Jeffreys prior assuming independence between


location and scale parameter (see Example 3.24), i.e.,

π(θ) = π(β)π(σ 2 ). (6.80)

Corollary 6.4 For y|θ ∼ Nn (Xβ, σ 2 In ), the Jeffreys prior of θ = (β, σ 2 ),


assuming (6.80), is
1
π(θ) ∝ 2 .
σ

Proof: Under independence, π(θ) = π(β, σ 2 ) = π(β)π(σ 2 ), where most of


the computations are as in Theorem 6.6. Thus, we have for known σ 2
1 T −1
I(β) = X Σ X
σ2
which is independent of β, so that π(β) ∝ 1. Similarly for known β, it follows
from
n
I(σ 2 ) =
2(σ 2 )2
that
 1
π(σ 2 ) = I(σ 2 ) ∝ 2 (6.81)
σ
so that π(θ) ∝ 1/σ 2 , as needed to be proved.
2

We observe that the Jeffreys prior under (6.80) is different from that in Corol-
lary 6.3. In particular, it does not depend on p. In both cases, however, Jeffreys
prior reduces to an improper prior for θ = (β, σ 2 ). But the posterior belongs
to the conjugate family, as under the conjugate prior. This is stated in the fol-
lowing theorem. First recall that the maximum likelihood estimator in model
BAYES LINEAR MODELS 157
(6.74) is the generalized least-squares estimator

βΣ = arg minp (y − Xβ)T Σ−1 (y − Xβ)


β∈R
(6.82)
= (XT Σ−1 X)−1 XT Σ−1 y.

Theorem 6.7 Given y|θ ∼ Nn (Xβ, σ 2 Σ) with Σ known. Then, under Jef-
freys’ priors
π(β, σ 2 ) ∝ (σ 2 )−m
with 2m = p + 2 in (6.77) and m = 1 in (6.81). It follows that

(β, σ 2 )|y ∼ NIG(am , b, γ, Γ) (6.83)

and
b
β|y ∼ tp (am , βΣ , (XT Σ−1 X)−1 ) (6.84)
am
with
am = 2m + n − p − 2
b = (y − XβΣ )T Σ−1 (y − XβΣ )
(6.85)
γ = (XT Σ−1 X)−1 XT Σ−1 y = βΣ
Γ = (XT Σ−1 X)−1 .

Proof: We calculate

π(β, σ 2 |y) ∝ π(β, σ 2 )(β, σ 2 |y)


 
1 1 T −1
∝ exp − (y − Xβ) Σ (y − Xβ) .
(σ 2 )n/2+m 2σ 2

Since

βΣ = (XT Σ−1 X)−1 XT Σ−1 (Xβ + )


= β + (XT Σ−1 X)−1 XT Σ−1 ,

we have

y − XβΣ = + X(β − βΣ )
= (In − X(XT Σ−1 X)−1 XT Σ−1 ) .
158 NORMAL LINEAR MODELS
The projection property of the generalized least-squares estimator implies that
the mixed term is zero:

(y − XβΣ )T Σ−1 X(β − βΣ )


 
= T In − Σ−1 X(XT Σ−1 X)−1 XT Σ−1 X(XT Σ−1 X)−1 XT Σ−1
=0

Thus the following decomposition holds.

(y − Xβ)T Σ−1 (y − Xβ)


= (βΣ − β)T XT Σ−1 X(βΣ − β) + (y − XβΣ )T Σ−1 (y − XβΣ ) (6.86)
= (βΣ − β)T XT Σ−1 X(βΣ − β) + b.

Using (6.86) we obtain


 
1
π(β, σ 2 |y) ∝ (σ 2 )−(m+n/2) exp − 2 ((βΣ − β)T XT Σ−1 X(βΣ − β) + b) ,

which implies (6.83). Applying Lemma 6.5, we obtain (6.84).


2

For Σ = In , the posterior coincides with the least–squares solution, as stated


in the following corollary.

Corollary 6.5 Given y|θ ∼ Nn (Xβ, σ 2 In ). Then, under Jeffreys prior

β|y, σ 2 ∼ Np (β, σ 2 (XT X)−1 ),

with β = (XT X)−1 XT y.

Example 6.11 (Life length vs. life line)


We continue with the centered model (6.34) in Examples 6.4 and 6.10, now
using Jeffreys prior (6.81). Recall

1 
n
sxy
Σ = In , β = , XT Σ−1 X = sxx , s2 = (yi − ȳ − β(xi − x̄))2 .
sxx n − 1 i=1

We apply the posteriors from Theorem 6.7, i.e.,

β|y ∼ t1 (n − 1, β, s2 /sxx ) and σ 2 |y ∼ InvGamma((n − 1)/2, (n − 1)s2 /2).


LINEAR MIXED MODELS 159
0.012

0.15
0.008

0.10
0.004

0.05
0.000

0.00
50 100 150 200 250 300 −10 −5 0 5

variance slope

Figure 6.8: Example 6.11. Left: Posterior distribution InvGamma(23.5, 3456) of σ 2 .


Right: Posterior distribution t1 (47, −2.26, 2.38) of β with mean β shown as vertical
line.

In Example 6.3 we calculated β = −2.26, sxx = 61.65 and s2 = 147, 07. Figure
6.8 depicts the posterior distributions.
2

Example 6.12 (Corn plants)


Example 6.5 analyzes corn data (Example 6.1) using conjugate priors. Here,
we re-analyze it for Jeffreys prior.
With y ∼ N(Xβ, σ 2 I), keeping σ 2 = 300 as before, the posterior distribu-
tion for β = (β0 , β1 , β2 , β3 )T under Jeffreys prior is computed as π(β|y, σ 2 ) ∼
N4 (β, σ 2 (XT X)−1 ), where β and (XT X)−1 are given in Example 6.5. The
posterior distributions for the three betas are shown in Figure 6.9, where ver-
tical lines at the center of each curve are the mean values, βj , j = 1, 2, 3. 2

6.3 Linear Mixed Models


In this section, we extend the general linear model, (6.2), to linear mixed
model. For general theory of linear mixed models and their applications, see
e.g., Demidenko (2013) and Searle et al. (2006).
Unlike model (6.2) a linear mixed model includes an additional random com-
ponent besides the error variable. We consider the linear mixed model as
defined in Demidenko (2013, Chapter 2):

y = Xβ + Zγ + , ∼ Nn (0, σ 2 Σ), γ ∼ Nq (0, σ 2 Δ), (6.87)


160 NORMAL LINEAR MODELS

π(β1 | y)
π(β2 | y)
π(β3 | y)

π(β)

Figure 6.9: Example 6.12. Prior and posterior distributions of β1 , β2 , β3 for corn data
with Jeffreys prior.

where y ∈ Rn is the observed vector of response variables, X ∈ Rn×p and


Z ∈ Rn×q are the known design matrices for fixed and random parts of the
model. The vector of unobserved random components γ ∈ Rq is a latent vari-
able, is the vector of unobserved errors, where we assume that and γ are
independent. The matrices Σ  0 and Δ  0 are known. The unknown pa-
rameter θ = (β, σ 2 ) consists of the vector of regression parameters β ∈ Rp
and the variance parameter σ 2 > 0. We assume that X and Z are full rank
matrices, i.e., r(X) = p and r(Z) = q, where n > p and n > q.

Regarding notation, it may be emphasized that we use γ in (6.87) for random


component, following the literature. In the literature, γ is often called random
parameter and its distribution a prior. In our textbook we want to clarify that
γ and its distribution are part of the statistical model and are not explained
by a Bayes model. The Bayes model assumes priors on the parameter θ.
Often, Model (6.87) is more convenient at observation level, i.e.,

yi = xTi β + zTi γ + i , (6.88)

with xi ∈ Rp and zi ∈ Rq , i = 1, . . . , n. We illustrate it by the following


example.

Example 6.13 (Hip operation)


We use a data set discussed in Crowder and Hand (1990, Example 5.4, p.78).
The data consist of 4 repeated measurements taken on each of 30 hip opera-
tion patients, one before operation and three after it, with age recorded as a
LINEAR MIXED MODELS 161
covariate. The measurements are haematocrit (volume percentage of red blood
cells measured in a blood test) as an indicator of patients’ health. For male
patients, values below 40.7 may indicate anemia, values above 50.3 are also
a bad sign. The patients consist of two independent groups, 13 males and 17
females. For our purposes, we use male group data, removing three discordant
observations and the corresponding columns. Hence, the analyzed data are of
n = 10 patients at two time points. We consider age as fixed factor, and, for
illustrative purposes, treat the two repeated observations as levels of random
(time) factor (q = 2),

yij = β0 + xi β1 + γj + ij , i = 1, . . . , 10, j = 1, 2.

Collecting the observations in one column as

y = (y11 , y12 , y21 , . . . , y10,2 )T , x = (x1 , x1 , x2 , x2 , . . . , x10 )

and setting β = (β0 , β1 )T ,

γ = (γ1 , γ2 )T , Z = (I2 , . . . , I2 )T

the model for this data can be stated as

y = Xβ + Zγ + , ∼ N20 (0, σ 2 Σ), γ ∼ N2 (0, σ 2 Δ)

where X ∈ R20×2 (with first column of 1s for intercept), Z ∈ R20×2 and


∈ R20 , hence, y ∈ R20 . The data are illustrated in Figure 6.10. 2

In model (6.87), we have two main tasks: estimation of θ and prediction of γ.


Before we start with the Bayesian approach we present frequentist methods
for estimation of β and prediction of γ.

Estimation of β
First we are only interested in estimating the regression parameter β. We
eliminate the random component γ in the model. Setting

ξ = Zγ + , and V = ZΔZT + Σ

we obtain a univariate linear model like (6.2)

y = Xβ + ξ, with ξ ∼ Nn (0, σ 2 V). (6.89)

The model (6.89) is called marginal model . The maximum likelihood estimator
of β follows as

βV = (XT V−1 X)−1 XT V−1 y. (6.90)


162 NORMAL LINEAR MODELS

50
● first visit
● second visit
45 ● ●


● ●

Haematocrit

40




35
30

64 66 68 70 72 74 76

Age

Figure 6.10: Illustration for the data set used in Example 6.13. Values under the
broken line indicate anemia. But the blood values are improved for most of patients.
Note that, the three patients with same age 70, are plotted as with age 69.5, 70, 70.5.

Prediction of γ
First we re-formulate the model (6.87) in a hierarchical setting

y|γ, θ ∼ Nn (Xβ + Zγ, σ 2 Σ)


(6.91)
γ|θ ∼ Nq (0, σ 2 Δ).

In the literature, this model formulation is called two-level hierarchical model.


We consider the joint probability function as function of (β, γ):

p(y, γ|θ) = p(y|γ, θ)p(γ|θ)


(6.92)
=: h(β, γ)

It holds that
 
1  T −1 T −1

h(β, γ) ∝ exp − 2 (y − Xβ − Zγ) Σ (y − Xβ − Zγ) + γ Δ γ .

Maximizing h(β, γ) with respect to (β, γ) delivers the least-squares minimiza-


tion problem
 
min (y − Xβ − Zγ)T Σ−1 (y − Xβ − Zγ) + γ T Δ−1 γ .
β,γ
LINEAR MIXED MODELS 163
Putting the first derivatives to zero we obtain the normal equation system:

XT Σ−1 Xβ + XT Σ−1 Zγ = XT Σ−1 y (6.93)


−1
T
Z Σ Xβ + (ZT Σ−1 Z + Δ−1 )γ = ZT Σ−1 y (6.94)

The second equation (6.94) gives

γ = (ZT Σ−1 Z + Δ−1 )−1 ZT Σ−1 (y − Xβ).

Setting this in (6.93) delivers

XT WXT β = XT Wy

with
W = Σ−1 − Σ−1 ZT (ZT Σ−1 Z + Δ−1 )−1 ZT Σ−1 .
Applying Lemma 6.3 we obtain

W = (ZΔZT + Σ)−1 = V−1 .

Thus the solution of the equation system for β coincides with βV in (6.90)
and the prediction of γ is given by

γ = (ZT Σ−1 Z + Δ−1 )−1 ZT Σ−1 (y − XβV ), (6.95)

which can also be expressed as

γ = ΔZT V−1 (y − XβV ). (6.96)

It follows from

(ZT Σ−1 Z + Δ−1 )−1 ZT Σ−1 = ΔZT V−1 ,

with ZΔZT + Σ = V, that

(ZT Σ−1 Z + Δ−1 )ΔZT V−1 = ZT Σ−1 (ZΔZT + Σ)V−1 = ZT Σ−1 .

Further the prediction (6.96) is a plug-in method. An application of Lemma


6.2 to (6.91) implies
⎛ ⎞ ⎛⎛ ⎞ ⎛ ⎞⎞
y Xβ V ZΔ
⎝ ⎠ |θ ∼ Nn+q ⎝⎝ ⎠ , σ 2 ⎝ ⎠⎠ . (6.97)
γ 0 ΔZT Δ

Note that, the joint distribution (6.97) explains the name marginal model for
(6.89). Lemma 6.1 implies

E(γ|y, β) = ΔZT V−1 (y − Xβ).


164 NORMAL LINEAR MODELS
Thus the prediction (6.96) coincides with the conditional expectation, with
βV replacing β.

6.3.1 Bayes Linear Mixed Model, Marginal Model


In this section we consider the marginal model,
% &
P = N(Xβ, σ 2 V) : V = Σ + ZΔZT , θ = (β, σ 2 ) ∈ Rp × R+ ,

and derive the posterior distributions π(θ|y) under conjugate and non-
informative priors. We start with the conjugate prior πc , i.e.,

θ ∼ NIG(a, b, β0 , Γ). (6.98)

The Bayes model {P, πc } is a univariate linear model with conjugate prior.
We can apply Theorem 6.5 and obtain:

Theorem 6.8 For the marginal model set up {P, πc }, it holds that

θ|y ∼ NIG( a1 , b1 , β1 , Γ1 ), (6.99)

and
b1
β|y ∼ tp (a1 , β1 , Γ1 ), (6.100)
a1
with

a1 = a+n (6.101)
−1
b1 = b + (y − Xβ1 ) M T
(y − Xβ1 ) (6.102)
β1 = β0 + ΓXT M−1 (y − Xβ0 ) (6.103)
−1
Γ1 = Γ − ΓX M T
XΓ (6.104)
M = V + XΓXT (6.105)

Now using the Jeffreys prior πm

πm (θ) ∝ (σ 2 )−m , (6.106)

where m = 1 gives the prior, under the condition that location and scale
parameter are independent. Otherwise, 2m = p + 2 without this condition;
see Section 6.2.3. The Bayes model {P, πm } is a univariate linear model with
non-informative prior. We apply Theorem 6.7 and obtain:
LINEAR MIXED MODELS 165

Theorem 6.9 For the marginal model set up {P, πm } it holds that

θ|y ∼ NIG(am , b, βV , XT V−1 X), (6.107)

and
b T −1
β|y ∼ tp (am , βV , X V X), (6.108)
am
where
βV = (XT V−1 X)−1 V−1 Xy

am = 2m + n − p − 2 (6.109)
−1
b = (y − XβV ) VT
(y − XβV ). (6.110)

6.3.2 Bayes Hierarchical Linear Mixed Model


Here, we consider linear mixed models (6.87) including the latent variable γ
and its distribution. A Bayes model assumes a prior distribution on the pa-
rameter θ = (β, σ 2 ). Our aim is to derive posterior π(θ|y) for conjugate priors.
Furthermore we are interested in the predictive distribution p(γ|y).

We assume
θ ∼ NIG(a, b, β0 , Γ). (6.111)
The entire model set up can be formulated as

y|β, σ 2 , γ ∼ Nn (Xβ + Zγ, σ 2 Σ) (6.112)


β|σ 2
∼ 2
Np (β0 , σ Γ) (6.113)
γ|σ 2 ∼ Nq (0, σ 2 Δ) (6.114)
σ 2
∼ InvGamma(a/2, b/2) (6.115)

where β|σ 2 , and γ|σ 2 are assumed to be mutually independent. We see that
under Bayes model set up the random component γ and the parameter θ have
equivalent status. It also explains why often γ is called random parameter and
its distribution prior. Set the joint parameter as α = (β, γ), α0 = (β0 , 0) and
U = (X Z), Ψ = diag(Γ Δ). Then the model set up can also be written as

y|α, σ 2 ∼ Nn (Uα, σ 2 Σ)
(6.116)
(α, σ 2 ) ∼ NIG(a, b, α0 , Ψ).
166 NORMAL LINEAR MODELS
Following theorem is a consequence of Theorem 6.5, as it sums up the poste-
rior under (6.116).

Theorem 6.10 For the model set up in (6.116), it holds that

(α, σ 2 )|y ∼ NIG(a1 , b1 , α1 , Ψ1 ), (6.117)

and
b1
α|y ∼ tp+q (a1 , α1 , Ψ1 ), (6.118)
a1
with

a1 = a+n (6.119)
−1
b1 = b + (y − Uα1 ) M T
(y − Uα1 ) (6.120)
= b + y Σ y + α0T Ψ−1 α0 −
T −1
α1T Ψ−1
1 α1 (6.121)
α1 = α0 + ΨUT M−1 (y − Uα0 ) (6.122)
= Ψ1 (UT Σ−1 y + Ψ−1 α0 ) (6.123)
−1
Ψ1 = Ψ − ΨU M UΨ
T
(6.124)
= (Ψ−1 + UT Σ−1 U)−1 (6.125)
M = Σ + UΨUT (6.126)

where the component for β|y in α|y is the posterior distribution of β and
the component for γ|y is the predictive distribution for γ.

Note that the formulas for the posterior parameters can be presented in dif-
ferent ways as stated in Theorem 6.5. The presentation above is helpful for a
separation between β and γ. Especially in (6.120) it holds for α1T = (β1T , γ1T )
that
b1 = b + (y − Xβ1 − Zγ1 )T M−1 (y − Xβ1 − Zγ1 )
and in (6.124)
⎛ ⎞
Γ − ΓXT M−1 XΓ −ΓXT M−1 ZΔ
Ψ1 = ⎝ ⎠.
−ΔZT M−1 XΓ Δ − ΔZT M−1 ZΔ
Using the presentation (6.125) we obtain
⎛ ⎞
T −1 T −1
Γ + X Σ X X Σ Z
Ψ−1
1 =
⎝ ⎠. (6.127)
ZT Σ−1 X Δ + ZT Σ−1 Z
Further
M = Σ + XΓXT + ZΔZT = V + XΓXT
LINEAR MIXED MODELS 167
−1
β1 = β0 + ΓX M T
(y − Xβ0 )
and
γ1 = ΔZT M−1 y.
We see that β1 included in (6.122) coincides with β1 in (6.103).

The practically most commonly used special case of Theorem 6.10 is when
all covariance matrices are spherical and the design matrices are orthogonal.
Furthermore we consider σ 2 as known. We state the result as a corollary to
Theorem 6.10.

Corollary 6.6 Given the model set up in (6.116), with Γ = rβ Ip , Δ = rγ Iq


and Σ = In , where ZT X = O, then

β|y, σ 2 ∼ Np (μβ|y , σ 2 Σβ|y )


(6.128)
γ|y, σ 2 ∼ Nq (μγ|y , σ 2 Σγ|y )

Which are conditionally independent, with

μβ|y = (rβ−1 Ip + XT X)−1 (XT y + rβ−1 β0 )


Σβ|y = (rβ−1 Ip + XT X)−1
(6.129)
μγ|y = (rγ−1 Iq + ZT Z)−1 ZT y
Σγ|y = (rγ−1 Iq + ZT Z)−1 .

Proof: First, for the given set up, we have


⎛ ⎞
−1
r I 0
Ψ−1 = ⎝ β ⎠.
p

0 rγ−1 Iq

Using U = (X Z) and the orthogonality of X and Z


⎛ ⎞
T
X X 0
UT Σ−1 U = ⎝ ⎠.
0 ZT Z

Thus from (6.125) it follows that


⎛ ⎞
(r−1 Ip + XT X)−1 O
Ψ1 = ⎝ β ⎠. (6.130)
O (rγ−1 Iq + ZT Z)−1
168 NORMAL LINEAR MODELS
The block diagonal structure implies the independence of the posteriors given
the parameter σ 2 . Further we apply (6.123) and calculate
⎛ ⎞
β1
α1 = ⎝ ⎠ = Ψ1 (UT Σ−1 y + Ψ−1 α0 )
γ1

with ⎛ ⎞
XT y + rβ−1 β0
UT Σ−1 y + Ψ−1 α0 = ⎝ ⎠
ZT y
and obtain μβ|y and μγ|y .
2
For a better understanding of the formulas above we discuss a simple special
case.

Example 6.14 (Simple linear mixed model)


Suppose the following data generating equation:

yi = xi β + γ + εi , εi ∼ N(0, σ 2 ), γ ∼ N(0, σ 2 λ), i = 1, . . . , n.

The parameter of interest is θ = (β, σ 2 ). The design points xi are known


and centered, x̄ = 0. The random variables γ and ε1 , . . . , εn are mutually
independent and unobserved. The prior π(θ) is NIG(a, b, β0 , τ ). Because x̄ = 0
and z = 1 we have XT Z = 0, and we apply Corollary 6.6 to obtain

β|y, σ 2 ∼ N(β1 , σ 2 τ1 )

with
β1 = τ1 (sxy + τ −1 β0 )
τ1 = (sxx + τ −1 )−1

and
γ|y, σ 2 ∼ N((n + λ−1 )−1 nȳ, σ 2 (n + λ−1 )−1 ).
From Theorem 6.10 with (6.121), it follows that

σ 2 |y ∼ InvGamma((a + n)/2, b1 /2)

with

b1 = b + ny 2 + τ −1 β02 − (sxx + τ −1 )−1 (sxy + τ −1 β0 )2 − (n + λ−1 )−1 (nȳ)2 ,


n
where ny 2 = i=1 yi2 . Summarizing it holds that

(β, σ 2 )|y ∼ NIG(a + n, b1 , β1 , τ1 )


LINEAR MIXED MODELS 169
and Lemma 6.5 gives
b1
β|y ∼ t1 (a + n, β1 , τ1 ).
a+n
2

Example 6.15 (Hip operation)


We continue with Example 6.13 and apply Theorem 6.10. Using conjugate
prior NIG(a, b, α0 , I4 ) with a = 6, b = 40 and α0 = 0, we obtain a1 = 26,
b1 = 451.6 and
⎛ ⎞ ⎛ ⎞
2.62 0.9668 −0.0138 −0.0166 0.0166
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
⎜ 0.50 ⎟ ⎜−0.0138 0.0003 −0.0069 −0.0069⎟
α1 = ⎜

⎟ Ψ1 = ⎜
⎟ ⎜
⎟.

⎜ 4.86 ⎟ ⎜−0.0166 −0.0069 0.5371 0.4462 ⎟
⎝ ⎠ ⎝ ⎠
−2.23 −0.0166 −0.0069 0.4462 0.5371

The distributions of β|y and γ|y are


⎛ ⎛ ⎞ ⎛ ⎞⎞
2.62 16.793 −0.240
β|y ∼ t2 ⎝26, ⎝ ⎠,⎝ ⎠⎠
0.50 −0.240 0.005

and ⎛ ⎛ ⎞ ⎛ ⎞⎞
4.86 9.330 7.751
γ|y ∼ t2 ⎝26, ⎝ ⎠,⎝ ⎠⎠ .
−2.23 7.751 9.330
2

R Code 6.3.6. Hip data, Example 6.15.

# data
Age<-c(66,66,70,70,70,70,74,74,65,65,71,71,68,68,69,69,
64,64,70,70)
Haematocrit<-c(47.1,32.8,44.1,37,43.3,36.6,37.4,29.05,45.7,
39.8,46.05,37.8,42.1,36.05,38.25,30.5,43,36.65,37.8,30.6)
# joint design matrix
U<-matrix(rep(0,4*20),ncol=4)
U[,1]<-rep(1,20)
U[,2]<-Age
U[,3]<-c(1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0)
U[,4]<-c(0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1)
y<-Haematocrit
# prior: NIG(a0,b0,alpha0,Psi)
a0<-3*2
170 NORMAL LINEAR MODELS
b0<-10*4
b0/(a0-2) # prior expectation variance
(b0/2)^2/((a0/2-1)^2*(a0/2-2))# prior variance of variance
Psi<-diag(4)
alpha0<-c(0,0,0,0)
# posterior: NIG(a1,b1,alpha1,Psi1)
UU<-t(U)%*%U
Psi1<-solve(solve(Psi)+UU)
alpha1<-Psi1%*%(t(U)%*%y+solve(Psi)%*%alpha0)
a1<-a0+length(Age)
b1<-b0+y%*%y+alpha0%*%solve(Psi)%*%alpha0
-alpha1%*%solve(Psi1)%*%alpha1
# t-distribution
s<-as.numeric(b1/a1)
s*Psi1

6.4 Multivariate Linear Models


Here, we extend the univariate general linear model in (6.2) to the multivari-
ate case, i.e., when a vector of responses is observed on each unit. We give
the results and the main arguments; for technical details we refer the reader
to Zellner (1971, Chapter 8) and Box and Tiao (1973, Chapter 8).

We define the multivariate general linear model as

Y = XB + E, (6.131)

where Y = (yij ) = (y1T , . . . , ynT )T ∈ Rn×d is the n × d matrix of re-


sponses, E = ( ij ) = ( T1 , . . . , Tn )T is the n × d matrix of random errors,
B = (β1T , . . . , βpT )T ∈ Rp×d is the p × d matrix of unknown parameters, and
X = (xT1 , . . . , xTn )T ∈ Rn×p is the n × p design matrix, assumed to be of full
rank, i.e., r(X) = p. The row vectors of the response matrix in (6.131),

y i = BT x i + i , i ∼ Nd (0, Σ) i = 1, . . . , n (6.132)

are i.i.d. This implies that the elements of the response matrix are normally
distributed, where the elements in different columns are correlated, while ele-
ments in different rows are independent.

In order to describe this distribution in a closed form, we use the notion of a


matrix-variate normal distribution, (Gupta and Nagar, 2000, Chapter 2).
MULTIVARIATE LINEAR MODELS 171

Let Z = (zij ) be an m × k random matrix. Using the vec operator,


vec(Z) = (z11 , . . . , zm1 , z12 , . . . , zmk )T , we assume

vec(Z) ∼ Nmk (vec(M), V ⊗ U)

where M ia an m×k location matrix, ⊗ denotes the Kronecker product


and V is k×k positive definite scale matrix for the columns, U is m×m
positive definite scale matrix for the rows. We write

Z ∼ MNm,k (M, U, V). (6.133)

For θ = (M, U, V), the density f (Z|θ) is proportional to


 
−m −k 1  −1 T −1

|V| 2 |U| exp − tr V (Z − M) U (Z − M) .
2 (6.134)
2

This implies that, E ∼ MNn,d (O, In , Σ), where In is the identity matrix.
Correspondingly, then,

Y ∼ MNn,d (XB, In , Σ),

with density
 
−nd/2 −n/2 1  −1 
f (Y|B, Σ) = (2π) |Σ| exp − tr Σ (Y − XB) (Y − XB) . T
2
(6.135)
The multivariate model (6.131) can be considered columnwise as a collection
of d correlated univariate models. Let y(s) , β(s) , and (s) be the sth columns of
the response matrix Y, parameter matrix B, and error matrix E, respectively.
Then

y(s) = Xβ(s) + (s) , (s) ∼ Nn (0, σs2 In ), s = 1, . . . , d (6.136)

with Ey(s) = Xβ(s) , so that EY = XB = (Xβ(1) , . . . , Xβ(d) ). Note, how-


ever, that, (s) are correlated, so that Cov( (s) ) = σs2 In = Cov(y(s) ) and
Cov( (s) , (l) ) = σsl In .

Under this setting, the model for Y ∈ Rn×d can be written as


% &
P = MNn,d (XB, In , Σ) : B ∈ Rp×d , Σ ∈ Rd×d , Σ  0 . (6.137)
172 NORMAL LINEAR MODELS
For a better understanding of multivariate model, we go back to the corn data
example.

Example 6.16 (Corn plants) We analyze complete corn data, using both
dependent variables; see Example 6.1 on page 127. The three independent
variables are the same as used in Examples 6.5 and 6.12, where the analysis
with one dependent variable was discussed.
With n = 17 independent vectors, where d = 2 and p = 4, we have Y and X
as 17 × 2 and 17 × 4 matrices, respectively. 2

We continue with analysis of the statistical model (6.137) with unknown pa-
rameter θ = (B, Σ). Writing (6.135) as likelihood function the maximum like-
lihood estimation problem
 
n 1  
arg max − ln |Σ| − tr Σ−1 (Y − XB)T (Y − XB) ,
B,Σ 2 2

yields the MLEs of B and Σ, respectively, as

B = (XT X)−1 XT Y (6.138)


1
Σ = (Y − XB)T (Y − XB), (6.139)
n
(see, e.g., Mardia et al., 1979, Chapter 6). The fact that each univariate model
in (6.136) has the same design matrix X implies Y = XB = PY, with
P = X(XT X)−1 XT the same projection matrix as for the univariate case.
Thus, we can write Y = (Xβ1 , . . . , Xβd ) as a collection of d fitted univariate
models, with the exception that they are possibly correlated. This also allows
the similar orthogonal partition of the model matrices

(Y − XB)T (Y − XB) = S + (B − B)T XT X(B − B) (6.140)

with S = (Y − XB)T (Y − XB) = nΣ.

Finally, for B as linear function of Y, the normality assumption implies

B ∼ MNp,d (B, (XT X)−1 , Σ)

and for nΣ we have the Wishart distribution,

nΣ ∼ Wd (n − p, Σ).

We hereby recall a general definition of the Wishart distribution.


BAYES MULTIVARIATE LINEAR MODELS 173

The k × k positive definite random matrix, W, follows the Wishart


distribution
W ∼ Wk (ν, V),
with ν > k − 1 degrees of freedom and k × k positive definite scale
matrix V, iff its density is given by
 
− ν2 ν−k−1 1  −1

f (W|V, ν) = cW |V| |W| 2 exp − tr WV , (6.141)
2

where ⎛ ⎞−1

k
ν+1−j ⎠
c W = ⎝2
kν k(k−1)
2 π 4 Γ( ) . (6.142)
j=1
2

6.5 Bayes Multivariate Linear Models


In the following, we discuss Bayes models {P, π} with P given in (6.137). We
assume conjugate prior and Jeffreys prior to derive the posteriors over
% &
Θ = (B, Σ) : B ∈ Rp×d , Σ ∈ Rd×d , Σ positive definite . (6.143)

6.5.1 Conjugate Prior


To begin with conjugate prior, we introduce the conjugate family of normal–
inverse-Wishart distributions. First we introduce the inverse-Wishart distri-
bution IWd (ν, V) which is a multivariate extension of the inverse-gamma dis-
tribution.

The k×k positive definite random matrix, W, has the inverse-Wishart


distribution
W ∼ IWk (ν, V),
with degrees of freedom ν > k − 1 and k × k positive definite scale
matrix V, iff
W−1 ∼ Wk (ν, V−1 ).
Its density is
 
ν+k+1 1  
f (W|V, ν) = cIW |V| 2 |W|− exp − tr W−1 V .
ν
2 (6.144)
2

where cIW = cW given in (6.142).


174 NORMAL LINEAR MODELS
Set k = 1, V = λ, then

IW1 (ν, λ) ≡ InvGamma(ν/2, λ/2).

The normal-inverse-Wishart distribution NIW(ν, M, U, V) is then hierarchi-


cally defined as following.

Let Z be a m × k random matrix and let W be k × k positive definite


random matrix. The random matrices (Z, W) have a joint distribution
belonging to the family of normal-inverse-Wishart distributions

(Z, W) ∼ NIW(ν, M, U, V)

iff
Z|W ∼ MNm,k (M, U, W)
W ∼ IWk (ν, V)

The joint density of (Z, W) is proportional to the kernel k(Z, W|θ),


that is
 
− ν+k+m+1 1 −1 T −1
|W| 2 exp − tr W V + (Z − M) U (Z − M) ,
2
(6.145)
with the parameters, ν as degrees of freedom, M as m × k location
matrix, V as k × k positive definite scale matrix, and U as m × m
positive definite scale matrix. Thus the joint density is f (Z, W|θ) =
cNIW k(Z, W|θ) with
⎛ ⎞−1

k
ν+1−j ⎠
cNIW = ⎝2mk π
k(k−1)+2mk
|V| 2 |U|− 2 . (6.146)
ν k
4 Γ( )
j=1
2

For k = 1, M = m and V = λ, it holds that

NIW(ν, m, U, λ) ≡ NIG(ν, λ, m, U).

There exists a general result on marginal distributions of NIW distributions.


Before stating it we introduce the matrix-variate t-distribution. .
BAYES MULTIVARIATE LINEAR MODELS 175

Let Z be a m×k random matrix. Z has a matrix-variate t-distribution

Z ∼ tm,k (ν, M, U, V)

iff the density f (Z|θ) is proportional to


ν+m+k−1
|V|− 2 |U|− 2 |Ik + V−1 (Z − M)T U−1 (Z − M)|−
m k
2 (6.147)

with the parameters, ν as degrees of freedom, M as m × k location


matrix, V as k × k positive definite scale matrix, and U as m × m
positive definite scale matrix.

Note that the parametrization is different from the multivariate t-distribution.


For k = 1, V = λ, and M = m, we have
λ
tm,1 (ν, m, U, λ) ≡ tm (ν, m, U). (6.148)
ν
We have

Lemma 6.6 Assume

(Z, W) ∼ NIW(ν, M, U, V).

Then
Z ∼ tm,k (ν − k + 1, M, U, V).

Proof: The joint density of (Z, W) is proportional to


 
−k ν
− ν+k+m+1 1  −1 
|U| |V| |W|
2 2 2 exp − tr W A (6.149)
2
with
A = V + (Z − M) U−1 (Z − M) .
T

Note that
 
− ν+m+k+1 ν+m 1  
K(W) ∝ |W| 2 |A| 2 exp − tr W−1 A
2
is kernel of IW(ν + m, A). We have
ν+m
f (Z, W) ∝ |U|− 2 |V| 2 |A|−
k ν
2 K(W).
Thus we can integrate out W and obtain
ν+m
f (Z) ∝ |U|− 2 |V| 2 |A|−
k ν
2

ν+m
∝ |U|− 2 |V| 2 |V + (Z − M) U−1 (Z − M) |−
k ν T 2

ν+m
∝ |U|− 2 |V|− 2 |Ik + V−1 (Z − M) U−1 (Z − M) |−
k m T 2 .
176 NORMAL LINEAR MODELS
This is the kernel of tm,k (ν − k + 1, U, V).
2
We set the conjugate prior of θ = (B, Σ) as
(B, Σ) ∼ NIW(ν0 , B0 , C0 , Σ0 ), (6.150)
where B0 is first guess for the matrix of regression parameters B, C0 says
something about the accuracy of the guess, Σ0 includes subjective information
on the correlation between different univariate models, and ν0 measures how
many observations the guesses are based on. The density of normal-inverse-
Wishart (NIW) distribution, NIW(ν0 , B0 , C0 , Σ0 ), is:
 
ν0 +p+d+1 1  
π(B, Σ) ∝ |Σ0 |− 2 exp − tr Σ−1 (Σ0 + (B − B0 )T C−10 (B − B 0 ))
2
(6.151)
The entire Bayes model setup can be stated as

Y|θ ∼ MNn,d (XB, In , Σ)


B|Σ ∼ MNp,d (B0 , C0 , Σ) (6.152)
Σ ∼ IWd (ν0 , Σ0 ).

This leads to a normal-inverse-Wishart posterior, stated in the following the-


orem.

Theorem 6.11 Given (B, Σ) ∼ NIW(ν0 , B0 , C0 , Σ0 ), then

(B, Σ)|Y ∼ NIW(ν1 , B1 , C1 , Σ1 ) and B|Y ∼ tp,d (ν1 − d + 1, B1 , C1 , Σ1 )

with ν1 = ν0 + n and

B1 = C1 (XT Y + C−1
0 B0 )
 −1  −1
C1 = C0 + XT X
Σ1 = Σ0 + (Y − XB1 )T (Y − XB1 ) + (B0 − B1 )T C−1
0 (B0 − B1 ).

Proof: Using π(θ|x) ∝ π(θ)(θ|x), where now the observations x are denoted
by Y generated by the multivariate linear model in (6.131), we obtain
 
ν +p+d+1+n
− 0 1  −1 
π(θ|Y) ∝ |Σ| 2 exp − tr Σ term1
2
with
term1 = Σ0 + (Y − XB)T (Y − XB) + (B − B0 )T C−1
0 (B − B0 ).
BAYES MULTIVARIATE LINEAR MODELS 177
Completing the squares for B gives
term1 = Σ1 + (B − B1 )T C−1
1 (B − B1 ),

where
Σ1 = Σ0 + (Y − XB1 )T (Y − XB1 ) + (B0 − B1 )T C−1
0 (B0 − B1 )

with
C1 = (C−1 T
0 + X X)
−1
and B1 = C1 (XT XB + C−1
0 B0 ).

Summarizing, we obtain the posterior distribution proportional to


 
ν0 +n+d+p+1 1  T −1
 −1 
|Σ|− 2 exp − tr Σ1 + (B − B1 ) C1 (B − B1 ) Σ ,
2
which is the kernel of the normal-inverse-Wishart distribution in the state-
ment. The marginal posterior follows from Lemma 6.6.
2

6.5.2 Jeffreys Prior


Assuming prior independence of B and Σ, the joint prior follows as
πJeff (B, Σ) ∝ π(B)π(Σ).
Consider first π(B). The matrix XB is the location parameter in (6.135).
We apply the result on location models in Example 3.21 on vec(Y) ∼
Nnd (vec(XB), In ⊗ Σ) and obtain
π(B) ∝ const.
The prior π(Σ) is now related to model (6.132), where we set B = 0. Thus we
have to calculate the Fisher information of the model
y1 , . . . , yn i.i.d. Nd (0, Σ). (6.153)
Recall the calculation of Fisher information in Section 6.2.3. It includes the
calculation of the first derivative of the log-likelihood function with respect to
the unknown parameters. Here in model (6.153) the unknown parameter is the
symmetric matrix Σ = (σij ), with σij = σji , so that we have m = d(d + 1)/2
different parameters. But the vectorization by vec operator includes all d2
elements of the matrix,
vec(Σ) = (σ11 , . . . σd1 , σ12 , . . . , σd2 , σ13 , . . . , σdd )T .
The way out is to use a vectorization of a symmetric matrix with the operator
vech which stacks the columns by starting each column at its diagonal element,
and includes m elements,
vech(Σ) = (σ11 , . . . σd1 , σ22 , . . . , σd2 , σ33 , . . . , σdd )T .
178 NORMAL LINEAR MODELS
It holds that
vec(Σ) = D vech(Σ), (6.154)
where D is the d × m duplication matrix. Using this notation in Magnus and
2

Neudecker (1999), where a calculus for matrix differentiation is developed, the


Fisher information in model (6.153) is given by the m × m matrix (see also
Magnus and Neudecker, 1980)
I(Σ) = DT (Σ−1 ⊗ Σ−1 )D. (6.155)
Further, the determinant of Fisher information is calculated as in (see Magnus
and Neudecker, 1980)
|I(Σ)| = 2m |Σ|−(d+1) .
We obtain d+1
π(Σ) ∝ |I(Σ)| 2 ∝ |Σ|−
1
2 .
Finally
πJeff (B, Σ) ∝ |Σ|−(d+1)/2 . (6.156)
The following theorem gives the posterior and the marginal posterior.

Theorem 6.12 For the Bayes model {P, πJeff }, with P given in (6.137)
and πJeff as defined in (6.156)
 −1
(B, Σ)|Y ∼ NIW(n, B, XT X , S)
 T −1
B|Y ∼ tp,d (n − d + 1, B, X X , S)

with

B = (XT X)−1 XT Y
S = (Y − XB)T (Y − XB).

Recall that B is the least-squares estimator of B given in (6.138).


Proof: Adding the normal likelihood for vec(Y) ∼ Nnd (XB, In ⊗ Σ) from
(6.135), the joint posterior of θ = (B, Σ) is
 
1  
π(θ|Y) ∝ |Σ|−(n+d+1)/2 exp − tr Σ−1 (Y − XB)T (Y − XB) . (6.157)
2
Applying the orthogonal decomposition (6.140) we obtain
 
−(n+d+1)/2 1  −1 
π(θ|Y) ∝ |Σ| exp − tr Σ Ξ
2
with
Ξ = S + (B − B)T XT X(B − B).
BAYES MULTIVARIATE LINEAR MODELS 179
This is the kernel of the NIW distribution above. The marginal posterior fol-
lows from Lemma 6.6.
2

Example 6.17 (Corn plants)


We continue with Example 6.16, and use both non-informative and conjugate
priors. First, using the least-squares estimation, we obtain:
⎛ ⎞ ⎛ ⎞
17.0 188.2 700 2012 1295 1496
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
⎜ 188.2 3602.8 8585 22231 ⎟ ⎜ 16204 17688 ⎟
XT X = ⎜ ⎜
⎟ , XT Y = ⎜
⎟ ⎜
⎟.

⎜ 700.0 8585.1 31712 84882 ⎟ ⎜ 54081 65302 ⎟
⎝ ⎠ ⎝ ⎠
2012.0 22231.4 84882 267090 153606 182220

Thus ⎛ ⎞
64.440 26.823
⎜ ⎟ ⎛ ⎞
⎜ ⎟
⎜ 1.301 0.090 ⎟ 2087.17 −349.79
B=⎜

⎟, S = ⎝

⎠.
⎜−0.130 1.189 ⎟ −349.79 3693.80
⎝ ⎠
0.023 0.095

We apply Theorem 6.12 with Jeffreys prior, π(Σ) ∝ |Σ|3/2 , and assuming
Y ∼ MN17,2 (XB, I17 , Σ), the posterior of B is the multivariate t distribution,
t4,2 (16, B, (XT X)−1 , S), the components of which are given above.
For the conjugate prior, we assume ν0 = 5, σ02 = 500, with
B|Σ ∼ MN4,2 (0, I4 , Σ) and Σ ∼ IW(ν0 , σ02 I2 ), i.e., we set the
prior NIW(ν0 , 0, I4 , σ02 I2 ). From Theorem 6.11, the joint posterior is
NIW(ν1 , B1 , C1 , Σ1 ), where ν1 = 22, and the moments are computed as
⎛ ⎞
48367.52 −49.42 −602.68 −168.71
⎜ ⎟
⎜ ⎟
⎜ −49.42 79.06 −24.73 1.65 ⎟
C1 = 10−5 ⎜ ⎜


⎜ −602.68 −24.73 36.75 −5.08 ⎟
⎝ ⎠
−168.71 1.65 −5.08 3.12
⎛ ⎞
33.27 13.86
⎜ ⎟ ⎛ ⎞
⎜ ⎟
⎜ 1.33 0.10 ⎟ 4732.92 543.10
B1 = ⎜

⎟ , Σ1 = ⎝

⎠.
⎜ 0.26 1.35 ⎟ 543.10 4567.11
⎝ ⎠
0.13 0.14
2
180 NORMAL LINEAR MODELS

6.6 List of Problems


1. Consider a multiple regression model

yi = β0 + xi β1 + zi β2 + εi , i = 1, . . . , n
n n n
with orthogonal design,
  i.e., i=1 xi = 0, i=1 zi = 0, i=1 xi zi = 0,
n 2 n 2
x
i=1 i = n and i=1 iz = n where ε i are i.i.d. normally distributed with
expectation zero and variance σ 2 = 1. The unknown three dimensional pa-
rameter β = (β0 , β1 , β2 )T is normally distributed with mean μ = (1, 1, 1)T
and covariance matrix ⎡ ⎤
1 0.5 0
⎢ ⎥
⎢ ⎥
Σ = ⎢0.5 1 0⎥ .
⎣ ⎦
0 0 1
(a) Write the model equation in matrix form.
(b) Determine the posterior distribution of β. Specify the expressions in
(6.23) and (6.24).
2. Consider the simple linear regression model

yi = α + βxi + εi , i = 1, . . . , n, εi ∼ N(0, σ 2 ) i.i.d. (6.158)

The unknown parameter is θ = (α, β, σ 2 ). We are mainly interested in the


posterior of the slope β.
(a) Determine Jeffreys prior πJeff (θ) under prior independence of regression
parameter and variance.
(b) Calculate the posterior distributions:
i. π((α, β)|y, σ 2 ).
ii. π(β|y, σ 2 )
iii. π(β|y, α, σ 2 ).
iv. π(σ 2 |y).
v. π(β|y).
3. Consider again the simple linear regression model (6.158) with unknown
parameter θ = (α, β, σ 2 ). We are mainly interested in the posterior of the
variance σ 2 . Let Jeffreys prior be πJeff (θ) ∝ (σ 2 )−1 .
(a) Calculate the posterior distribution π(σ 2 |y, β, α).
(b) Compare it with π(σ 2 |y) in Problem 2 (b).
4. Assume again (6.158) and transform the model by centering the data, such
that
yi − ȳ = β(xi − x̄) + ξi , i = 1, . . . , n. (6.159)
(a) Determine the relation between ξ1 , . . . , ξn and ε1 , . . . , εn .
(b) Determine the covariance matrix of ξ = (ξ1 , . . . , ξn ) and calculate its
determinant.
LIST OF PROBLEMS 181
T
(c) Set ξ(−n) = (ξ1 , . . . , ξn−1 ) , i.e., drop the last observation. Calculate the
determinant of the covariance matrix of ξ(−n) .
(d) Assume model (6.159) with i = 1, . . . , n − 1.
i. Consider Jeffreys prior. Calculate π(β|y, σ 2 ) and π(σ 2 |y). Compare
the results with those for model (6.158).
ii. Let σ 2 = 1 and β ∼ N1 (γb , λ2 ). Calculate π(β|y). Using independent
priors, β ∼ N1 (γb , λ2 ) and α ∼ N1 (γa , λ1 ), compare the results with
those for model (6.158).
5. Assume the univariate linear model (6.2) with ∼ Nn (0, σ 2 In ). We are
interested in the precision parameter σ −2 . Set θ = (β, σ −2 ). Determine the
conjugate family.
6. Assume the univariate linear model (6.2) with ∼ Nn (0, σ 2 In ). We are
interested in η = (β, σ), where σ is the scale parameter. Determine the Jef-
freys priors with and without independence assumption (π(η) = π(β)π(σ)).
7. Consider three independent samples:

X = (X1 , . . . , Xm ) i.i.d. sample from N(μ1 , σ 2 λ1 )


Y = (Y1 , . . . , Ym ) i.i.d. sample from N(μ2 , σ 2 λ2 ) (6.160)
2
Z = (Z1 , . . . , Zm ) i.i.d. sample from N(μ3 , σ λ3 )

where λ1 , λ2 , λ3 are known and μ1 + μ2 + μ3 = 0.


(a) Re-write the model (6.160) as univariate model. Determine the response
vector y, the design matrix X and the error covariance matrix σ 2 Σ.
(b) Suppose a conjugate prior for θ = (μ1 , μ2 , σ 2 ) gives the expressions for
the posterior. Specify the posterior covariance of (μ1 , μ2 ) given σ 2 un-
der the prior covariance σ 2 I2 and λ1 = λ2 = λ3 = 1. Set the prior
expectation of (μ1 , μ2 ) as zero.
(c) Consider only the first two samples X, Y . Compute the posterior, using
the same prior as in (b).
(d) Compare the posterior covariance matrix of (μ1 , μ2 ) given σ 2 based on
two samples with the corresponding posterior covariance matrix based
on three samples.
8. We are interested in two parallel regression lines:

yi = α + βxi + εi , i = 1, . . . , n, εi ∼ N(0, σ 2 ) i.i.d.


(6.161)
zi = γ + βxi + ξi , i = 1, . . . , n, ξi ∼ N(0, σ 2 ) i.i.d.,
n
where εi and ξi are mutually independent and i=1 xi = 0. The unknown
parameter is θ = (α, γ, β, σ 2 ). Assume a conjugate prior NIG(a, b, 0, λ−1 I3 ).
(a) Re-write the model (6.161) as univariate model. Determine the response
vector y, the design matrix X and the error covariance matrix σ 2 Σ.
(b) Calculate π(α, γ, β|y, σ 2 ) and π(β|y, σ 2 ).
182 NORMAL LINEAR MODELS
(c) Calculate π(σ |y).
2

(d) Give the marginal posterior of the slope π(β|y).


9. Consider the linear mixed model.

yi = βxi + γzi + εi , i = 1, . . . , n, εi ∼ N(0, 1), i.i.d., γ ∼ N(0, 1),


n n n
where i=1 x2i = n and i=1 zi2 = n but i=1 xi zi = 0. Assume a conju-
gate prior for β and give an explicit formula for β|y.
10. Assume two correlated regression lines

yi = αy + βy xi + εi , i = 1, . . . , n, εi ∼ N(0, σ12 ) i.i.d.


(6.162)
zi = αz + βz xi + ξi , i = 1, . . . , n, ξi ∼ N(0, σ22 ) i.i.d.,
n n 2
where Cov(y, z) = σ1,2 and i=1 xi = 0, i=1 xi = n. The unknown
2 2
parameter is θ = (αy , βy , αz , βz , σ1 , σ12 , σ2 ).
(a) Formulate the multivariate model.
(b) Assume a conjugate prior NIW(ν, B0 , C0 , Σ0 ), where C0 = diag(c1 , c2 )
and B0 = 0, Σ0 = I2 . Give the posterior expectation of σ1,2 .
Chapter 7

Estimation

We consider the Bayes model {P, π}, where P = {Pθ : θ ∈ Θ} and π is the
known prior distribution of θ over Θ. The data generating distribution Pθ is
the conditional distribution of X given θ and it is known up to θ.
Having observed x for X ∼ Pθ we want to determine the underlying parameter
θ – exploiting the model assumption {P, π}.
Applying the Bayesian inference principle means that the posterior distribu-
tion takes over the role of the likelihood; see Chapter 2. All information we
have about θ is included in the posterior. Depending on different estimation
strategies we choose as estimator the
• mode,
• expectation, or
• median
of the posterior distribution. Before we discuss each method in more detail,
we illustrate it by three examples. In the first example all estimators coincide.

Example 7.1 (Normal i.i.d. sample and normal prior)


In continuation of Example 2.12, we have an i.i.d. sample X = (X1 , . . . , Xn )
from N(μ, σ 2 ) with known variance σ 2 and the prior of θ = μ is N(μ0 , σ02 ).
Then the posterior given in (2.8) is N(μ1 , σ12 ), with
x̄nσ02 + μ0 σ 2
μ1 =
nσ02 + σ 2
and
σ02 σ 2
σ12 = .
nσ02 + σ 2
The posterior is symmetric in μ1 ; see also Figure 2.5. We have
x̄nσ02 + μ0 σ 2
Mode(θ|x) = E(θ|x) = Median(θ|x) = .
nσ02 + σ 2
2
In the second example all estimators are different.

DOI: 10.1201/9781003221623-7 183


184 ESTIMATION

4
prior = beta(2,3)
posterior = beta(3,10)
3

● mle
● mode
2

expectation
median
1

● ●
0

0.0 0.2 0.4 0.6 0.8 1.0

theta

Figure 7.1: Example 7.2. All estimates are situated in the center of the posterior
distribution: mle= 0.123, mode= 0.18, expectation= 0.23, median= 0.21.

Example 7.2 (Binomial distribution and beta prior)


In continuation of Example 2.11, we have X|θ ∼ Bin(n, θ) and θ ∼
Beta(α0 , β0 ), then

θ|x ∼ Beta(α1 , β1 ), with α1 = α0 + x, β1 = β0 + n − x.

For α1 = β1 the posterior density

B(α1 , β1 )−1 θα1 −1 (1 − θ)β1 −1

is not symmetric and mode, expectation and median differ. Particularly, we


have
α1 − 1 α1 α1 − 13
Mode(θ|x) = , E(θ|x) = , Median(θ|x) ≈ ,
α1 + β1 − 2 α1 + β1 α1 + β1 − 23
as illustrated in Figure 7.1, where for median and mode, α1 > 1, β1 > 1. 2

The Bayesian estimation approach has at least two advantages. One is the ele-
gant handling of the nuisance parameters by exploring the marginal posterior
of the parameter of interest, and the other is that the posterior distribution
delivers information about the precision of the estimator. We demonstrate
both properties in the following example.
MAXIMUM A POSTERIORI (MAP) ESTIMATOR 185
Example 7.3 (Normal i.i.d. sample and conjugate prior)
We have an i.i.d. sample X = (X1 , . . . , Xn ) from N(μ, σ 2 ) with unknown
expectation μ and unknown variance σ 2 , thus θ = (μ, σ 2 ). The parame-
ter of interest is μ; the variance is the nuisance parameter. In Chapter 6
the conjugate prior for the linear model is derived. This example is a spe-
cial case with p = 1, where all design points equal 1. The conjugate family
is the normal-inverse-gamma distribution, given in (6.41). We set the prior
NIG(α0 , β0 , μ0 , σ02 ). Recall, the prior of μ given σ 2 is N(μ0 , σ 2 σ02 ), the prior of
σ 2 is the inverse-gamma distribution InvGamma( α20 , β20 ). The prior parameters
α0 , β0 , μ0 and σ02 are known. We have
 
α +3
2 − 02 1 1
π(θ) ∝ (σ ) exp − 2 (β0 + 2 (μ − μ0 ) ) . 2
2σ σ0
We give the main steps of deriving the posterior, as follows,

π(θ|x) ∝ π(θ)(θ|x)
1 
n
∝ π(θ)(σ 2 )− 2 exp −
n
(xi − μ)2 .
2 σ 2 i=1

Let s2 be the sample variance. Using



n
(xi − μ)2 = (n − 1)s2 + n(μ − x̄)2
i=1

and completing the squares such that


1 1 1 1
n(μ − x̄)2 + (μ − μ0 )2 = 2 (μ − μ1 )2 + 2 μ20 + nx̄2 − 2 μ21 ,
σ02 σ1 σ0 σ1
with
σ1−2 = σ0−2 + n and μ1 = σ12 (σ0−2 μ0 + nx̄),
we obtain the conditional posterior of μ given σ 2 by N(μ1 , σ 2 σ12 ). The marginal
posterior of σ 2 is the inverse-gamma distribution with parameters ( α21 , β21 ),
where
α 1 = α0 + n
1 2 1 (7.1)
β1 = β0 + (n − 1)s2 + 2 μ0 + nx̄2 − 2 μ21 .
σ0 σ1
The parameter of interest is μ, therefore we are interested in the posterior
 ∞
π(μ|x) = π(μ|σ 2 , x)π(σ 2 |x) d σ 2 .
0

From Lemma 6.5, the marginal distribution of a normal-inverse-gamma dis-


tribution NIG(α1 , β1 , μ1 , σ12 ) is the scaled and shifted t-distribution with α1
186 ESTIMATION
-
degrees of freedom, location parameter μ1 and scale parameter αβ11 σ1 , whose
density is
  12  − a12+1
1 (μ − μ1 )2
π(μ|x) ∝ 1+ . (7.2)
β1 σ12 β1 σ12
The posterior is symmetric around μ1 , thus all estimates coincide with μ1 . As
measure of precision we can use the posterior variance
β1
Var(μ|x) = σ2 .
α1 − 2 1
For illustration, we simplify the example and set μ0 = 0 and σ0 = 1. Then
n
μ1 = x̄
n+1
1
σ12 =
n+1
α1 = α 0 + n
n
β1 = β0 + (n − 1)s2 + x̄2
(n + 1)
and  
1 β0 n−1 2 n 2
Var(μ|x) = + s + x̄ .
α0 + n − 2 n+1 n+1 (n + 1)2
For n → ∞ the leading term of the posterior variance is 1 2
α0 +n−2 s . 2
Now we are ready to discuss the above-mentioned three estimation methods
in more detail. We begin with the mode.

7.1 Maximum a Posteriori (MAP) Estimator


In Chapter 2 the maximum likelihood principle is introduced, that the best
explanation of data x is given by the maximum likelihood estimator
(MLE)
θMLE (x) ∈ arg max (θ|x). (7.3)
θ∈Θ
Combining the maximum likelihood principle and the principle of Bayesian
inference, where the posterior is exploited instead of the likelihood, leads to
following definition.

Definition 7.1 (Maximum a Posteriori Estimator) Assume the


Bayes model {P, π} with posterior π(θ|x). The maximum a posteriori
estimator (MAP) θMAP is defined as

θMAP (x) ∈ arg max π(θ|x). (7.4)


θ∈Θ
MAXIMUM A POSTERIORI (MAP) ESTIMATOR 187
Recall Example 3.16 on lion’s appetite. There the likelihood estimator and
the MAP estimator give the same result: the lion was hungry before he had
eaten three persons.

Note that, it is enough to know the kernel of π(θ|x) ∝ π(θ)(θ|x). The inte-
gration step is not required. Otherwise, for unknown distributions we have to
carry out a numerical maximization procedure. In case the posterior belongs
to a known distribution family, we can apply mode’s formula.

We continue now with two examples, the first of which uses mode’s formula.

Example 7.4 (Normal i.i.d. sample and gamma prior)


In continuation of Example 2.13, we have an i.i.d. sample X = (X1 , . . . , Xn )
from N(0, σ 2 ). The parameter of interest is the precision parameter θ = σ −2
with
n gamma prior Gamma(α0 , β0 ). The posterior is Gamma(α0 + n2 , β0 +
1 2
2 i=1 xi ). Mode’s formula of Gamma(α, β) delivers

α−1
Mode = , α ≥ 1, (7.5)
β
and we obtain, for n > 1,
2α0 + n − 2
θMAP (x) = n .
2β0 + i=1 x2i

For comparison, the MLE is


n
θMLE (x) = arg max (θ|x) = n .
θ∈Θ i=1 x2i
2

In the next example we apply a numerical solution. Recall Example 3.7 on


the weather experts.

Example 7.5 (Weather) Continuation of Example 3.15. Denote φ(m,σ2 ) (.)


as density of N(m, σ 2 ). The conjugate prior is a normal mixture comprising
of two different experts’ reports

π(θ) = ω1 φ(m1 ,τ12 ) (θ) + ω2 φ(m2 ,τ22 ) (θ);

see Figure 7.2. Then the posterior is

π(θ|x) = ω1 (x) φ(m1 (x),τ1,p


2 ) (θ) + ω2 (x) φ(m (x),τ 2 ) (θ)
2 2,p

with

mi (x) = ρi (xτi2 + mi σ 2 ), τi,p


2
= ρi στi2 , ρi = (τi2 + σ 2 )−1 , i = 1, 2,
188 ESTIMATION

Figure 7.2: Two weather experts.

and
τi 1 1
ωi (x) ∝ ωi exp − 2 m2i + 2 mi (x)2
τi,p 2τi 2τi,p

with ω1 (x) + ω2 (x) = 1. Figure 3.10 shows the prior, the posterior, and
the observation. By searching for the maximum of the y-values we obtain
θMAP (4) = 5.33. 2

The MAP estimators have a useful connection to regularized and restricted


estimators in the linear model.

7.1.1 Regularized Estimators


Recall the normal linear model (6.2) in Chapter 6. Here we assume a Bayes
linear model with general error distribution and general prior.

y = Xβ + . (7.6)

The parameter of interest is θ = β ∈ Rp . The unobserved error has expec-


tation zero. We assume
p( ) ∝ exp(−L(ε)). (7.7)
The function L(ε) can be considered as a type of loss function; it is nonnega-
tive, symmetric around zero and large for large errors. The prior has a similar
MAXIMUM A POSTERIORI (MAP) ESTIMATOR 189
structure and is given by

π(β) ∝ exp(−pen(β)). (7.8)

The function pen(β) can be considered as a type of penalty function, it is


nonnegative, symmetric around zero and large for large parameters. Under
this set up the likelihood is

(β|y) ∝ exp(−L(y − Xβ))

and the posterior

π(β|y) ∝ exp (−L(y − Xβ) − pen(β)) .

We obtain for
βMAP (y) ∈ arg maxp π(β|y) (7.9)
β∈R

that it is equivalent to

βMAP (y) ∈ arg minp (L(y − Xβ) + pen(β))) . (7.10)


β∈R

It means, choosing the right Bayes model, the Bayes MAP estimator coincides
with a regularized estimator in regression. It is a bridge between Bayes and
frequentist approach. There is one more bridge. The optimization problem in
(7.10) has an equivalent formulation that there exists a constant k > 0 such
that
βMAP (y) ∈ arg min L(y − Xβ). (7.11)
{β:pen(β)≤k}

This means, choosing the right Bayes model, the Bayes MAP estimator co-
incides with a restricted estimator in regression. For short overview of
regularized and restricted estimators, we recommend Zwanzig and Mahjani
(2020, Section 6.4) and Hastie et al. (2015).

Now we consider popular cases for L and pen.

Ridge
We start with normal linear model with known covariance matrix Σ and con-
jugate prior:
 
1 T −1
(β|y) ∝ exp − (y − Xβ) Σ (y − Xβ)
2
  (7.12)
1 T −1
π(β) ∝ exp − (β − γ) Γ (β − γ)
2
190 ESTIMATION
The posterior is N(μβ|y , Σβ|y ) given in Corollary 6.1 as
 
1
π(β|y) ∝ exp − (β − μβ|y )T Σ−1 β|y (β − μ β|y )
2
(7.13)
μβ|y = Σβ|y (XT Σ−1 y + Γ−1 γ)
Σβ|y = (XT Σ−1 X + Γ−1 )−1 .
The posterior is symmetric around μβ|y , such that mode and expectation equal
μβ|y . Applying (6.18) for μβ|y , we obtain

βMAP (y) = γ + ΓXT (XΓXT + Σ)−1 (y − Xγ).


This estimator is also known as generalized ridge estimator. Setting Σ = I,
γ = 0 and Γ = λ−1 I, we obtain the classical ridge estimator
βridge = (XT X + λI)−1 XT y, (7.14)
introduced by Hoerl and Kennard (1970). The idea that Hoerl and Kennard
(1970) pursued was to augment XT X by λI, since XT X + λI is always invert-
ible even when XT X is not. The main properties of the ridge estimator are
presented in Zwanzig and Mahjani (2020, Section 6.4.1).

Lasso
Suppose a normal linear model with covariance matrix Σ = σ 2 I. The com-
ponents of β are independent and Laplace distributed La(m, b) with location
2
parameter m = 0 and scale parameter b = 2σλ , so that
 
1
(β|y) ∝ exp − 2 (y − Xβ) (y − Xβ)
T

⎛ ⎞
p (7.15)
λ
π(β) ∝ exp ⎝− 2 |βj |⎠ .
2σ j=1

The posterior is
⎛ ⎛ ⎞⎞
1 p
π(β|y) ∝ exp ⎝− 2 ⎝(y − Xβ)T (y − Xβ) + λ |βj |⎠⎠ . (7.16)
2σ j=1

The βMAP is equivalent to the lasso estimator, defined by Tibshirani (1996)



p
βlasso (y) ∈ arg minp (y − Xβ)T (y − Xβ) + λ |βj |, (7.17)
β∈R
j=1

see also Zwanzig and Mahjani (2020, Section 6.4.2).

Zou and Hastie (2005) propose the following compromise between Lasso and
ridge estimators.
MAXIMUM A POSTERIORI (MAP) ESTIMATOR 191
Elastic Net
Consider a normal linear model with covariance matrix Σ = σ 2 I. The compo-
nents of β are i.i.d. from a prior distribution given by a compromise between
σ2 σ2
N(0, λ(1−α) ) and Laplace La(0, λα 2 ) prior. For α = 1, it is the Laplace prior
2 2
La(0, σλ ) and for α = 0, it is the normal prior N(0, σλ ); all other values
α ∈ (0, 1) lead to a compromise. It gives
 
1
(β|y) ∝ exp − 2 (y − Xβ)T (y − Xβ)

⎛ ⎛ ⎞⎞
 p p (7.18)
⎝ λ ⎝ ⎠ ⎠
π(β) ∝ exp − 2 (1 − α) 2
βj + α |βj | .
2σ j=1 j=1

The βMAP is equivalent to the elastic net estimator, defined by


⎛ ⎞

p 
p
βnet (y) ∈ arg minp (y − Xβ)T (y − Xβ) + λ ⎝(1 − α) βj2 + α |βj |⎠
β∈R
j=1 j=1
(7.19)
in Zou and Hastie (2005). The penalty term is a compromise between L2 -type
ridge penalty and L1 -type lasso penalty.

Note that, the regularized estimators depend strongly on the weight λ assigned
to the penalty term. There exists an extensive literature on adaptive methods
for choosing the tuning parameter λ and respectively α, but this is beyond
the scope of this book. In the Bayesian context the tuning parameters are
hyperparameters of the prior. We refer to Chapter 3 where different proposals
for the prior choice are presented.

We conclude the section with an illustrative example.

Example 7.6 (Regularized estimators)


We consider a polynomial regression model

yi = β0 + β1 xi + β3 x2i + β4 x4i + εi , i = 1, . . . , n.

A small data set is generated with 9 equidistant design points between 0 and
4; εi ∼ N(0, 0.52 ) i.i.d. and the true regression function

f (x) = 3 + x + 0.5x2 − x3 + 0.2x4 .

Using R the lasso estimate is calculated for λ = 0.012; the ridge estima-
tor for λ = 0.05; and the elastic net estimator for α = 0.5 and λ = 0.02.
192 ESTIMATION
The R-packages also include methods for an adaptive choice of the tuning
parameters. Here for illustrative purposes the tuning parameters are chosen
arbitrarily. Figure 7.3 shows different fitted polynomials. 2

R Code 7.1.7. Regularized estimators in Figure 7.3.

a0<-3; a1<-1; a2<-0.5; a3<--1; a4<-0.2 # true parameter


xx<-seq(0,4,0.01)
xx2<-xx*xx
xx3<-xx*xx*xx
xx4<-xx*xx*xx*xx
ff<-a0+a1*xx+a2*xx2+a3*xx3+a4*xx4
plot(xx,ff,"l",ylim=c(-2,5),xlab="",ylab="",lwd=2) # true
x<-seq(0,4,0.5) # design points
x2<-x*x
x3<-x*x*x
x4<-x*x*x*x
f<-a0+a1*x+a2*x2+a3*x3+a4*x4
y<-f+rnorm(9,0,0.5) # generated observations
points(x,y,col=1,lwd=3)
## lse
A<-coef(lm(y~x+x2+x3+x4))
flse<-A[1]+A[2]*xx+A[3]*xx2+A[4]*xx3+A[5]*xx4
lines(xx,flse,lwd=2,lty=1,col=gray(0.4)) # lse fit
## ridge
library(MASS)
Ridge<-lm.ridge(y~x+x2+x3+x4, lambda=0.05)
R<-coef(Ridge)
fridge<-R[1]+R[2]*xx+R[3]*xx2+R[4]*xx3+R[5]*xx4
lines(xx,fridge,col=gray(0.4),lwd=2,lty=2) # ridge fit
## lasso
library(lars)
X<-matrix(c(x,x2,x3,x4),ncol=4)
Lasso<-lars(X,y,type="lasso")
L<-Lasso$beta[9,]
flasso<-L[1]+L[2]*xx+L[3]*xx2+L[4]*xx3+L[5]*xx4
lines(xx,flasso,col=gray(0.4),lwd=2,lty=3) # lasso fit
## elastic net
library(glmnet)
MLSE<-glmnet(X,y,alpha=0.5,df=5)
N<-coef(MLSE,s=0.02)
fnet<-N[1]+N[2]*xx+N[3]*xx2+N[4]*xx3+N[5]*xx4
lines(xx,fnet,col=gray(0.4),lwd=2,lty=4) # elastic net fit
BAYES RULES 193

5
4
● ●
3

● ● ●
2


1

true ●
lse
0

lasso ●
ridge ●
−1

net
−2

0 1 2 3 4

Figure 7.3: Example 7.6. Since the sample size n = 9 is small with respect to the
number of parameters p = 5 the least-squares method overfits the data. Different
regularized estimators correct the overfit.

7.2 Bayes Rules


In this section we discuss the methods based on expectation and median of
the posterior. Both are Bayes rules given in Definition 4.2. Readers, interested
in their optimality properties, are referred to Chapter 4. Here we present the
application. Set the following notation

θL2 (x) = E(θ|x)


(7.20)
θL1 (x) = Median(θ|x).

The subindex is related to the loss functions. The estimator θL2 obeys opti-
mality properties with respect to L2 loss and θL1 with respect to L1 loss. We
have already presented several examples related to these estimates. Besides
the introductory Examples 7.1, 7.2 and 7.3, check Examples 4.6, 4.7 and 4.8
for θL2 and Example 4.9 for θL1 .
In the following table priors and corresponding posterior expectations are
collected related to some popular one parameter exponential families with pa-
rameter θ, assuming all other parameters known. The table is partly taken
from Robert (2001), where posterior distributions are given for single x.
194 ESTIMATION

Distribution Prior Posterior


p(x|θ) π(θ) E(θ|x)
Normal Normal
2 σ 2 μ+τ 2 x
N(θ, σ ) N(μ, τ 2 ) σ 2 +τ 2

Poisson Gamma
α+x
Poi(θ) Gamma(α, β) β+1

Gamma Gamma
α+ν
Gamma(ν, θ) Gamma(α, β) β+x

Binomial Beta
α+x
Bin(n, θ) Beta(α, β) β+n−x

Negative Binomial Beta


α+m
NB(m, θ) Beta(α, β) α+β+x+m

Multinomial Dirichlet
Multk (θ1 , . . . , θk ) Dir(α1 , . . . , αk ) αi +xi
j αj +n

Normal Gamma
 
N μ, θ1 Gamma(α, β) 2α+1
2β+(x−μ)2

Normal InvGamma
2β+(μ−x)2
N (μ, θ) InvGamma(α, β) 2α−1

Recall that for symmetric single mode posteriors both estimators in (7.20)
coincide. In order to illustrate a posterior with two local maxima we come
back to the classroom Example 3.7 now with two more controversial and more
equally weighted weather experts.

Example 7.7 (Weather)


Continuation of Examples 3.7 and 3.15. Assume that both experts have very
different subjective priors and they are closely weighted than in (3.7); we set
the mixture prior

π(θ) = 0.4 φ(−8,4) (θ) + 0.6 φ(8,10) (θ), (7.21)

where φ(μ,σ2 ) denotes the density of N(μ, σ 2 ). Applying (3.24) we obtain for
x=1
π(θ|x) = 0.34 φ(−2,3.33) (θ) + 0.66 φ(4.89,2.22) (θ).
BAYES RULES 195

prior

0.15
posterior

0.10

● expectation
median
0.05
0.00

−15 −10 −5 0 5 10 15

theta

Figure 7.4: Example 7.7.

The posterior is more concentrated, but it is still a clear mixture of two com-
ponents, as seen in Figure 7.4. The estimate

θL2 = 0.34 (−2.3) + 0.66 (4.89) ≈ 2.55

is the weighted average of the expectation of both components. The estimate

θL1 ≈ 3.81

is calculated by a simple searching algorithm. The neighbourhood around the


posterior expectation has low posterior probability. Both estimates are illus-
trated in Figure 7.4. 2

We continue with more applications.

7.2.1 Estimation in Univariate Linear Models


We assume the linear model P given in (6.38):

y = Xβ + , ∼ Nn (0, σ 2 Σ), θ = (β, σ 2 ),

under both Bayes models, {P, πc }, with conjugate prior, and {P, πJeff }, with
non-informative prior. The joint posterior distribution is a normal-inverse-
gamma distribution,
θ|y ∼ NIG(a, b, γ, Γ),
196 ESTIMATION
see Theorems 6.5 and 6.7, respectively. Under the linear mixed model (6.87) we
also obtain a posterior belonging to the normal-inverse-gamma distributions,
see Theorems 6.8, 6.9 and 6.10.
This implies the marginal distributions
b
β|y ∼ tp (a, γ, Γ) and σ 2 ∼ InvGamma(a/2, b/2).
a
In case of known σ 2 we use
β|(y, σ 2 ) ∼ Np (γ, Γ).
The t distribution and the normal distribution are symmetric around the
location parameter. We have
βL2 (y) = βL1 (y) = γ.
The inverse-gamma distribution is not symmetric. We apply the formula for
expectation and obtain
b
σ 2 = σL2 2 (y) = .
a−2
Unfortunately, the median is not given in a closed form. In case, the L1 -
estimate of the variance is of main interest, numerical methods are needed.
There is no difference in the estimate of β wether σ 2 is known or unknown.
But its precision is different. It holds that
Cov(β|y, σ 2 ) = σ 2 Γ
and
b
Cov(β|y) = Γ = σ 2 Γ.
a−2

Example 7.8 (Simple linear regression)


We assume
yi = βxi + εi , i = 1, . . . , n, εi ∼ N(0, σ 2 ), i.i.d.
with θ = (β, σ 2 ). The prior NIG(a, b, γ, Γ) gives the posterior NIG(a1 , b1 , γ1 , Γ1 )
given in Example 6.8 on page 151. The proposed Bayes estimates are
b1
βL2 = γ1 and σ 2 =
a+n−2
with
−1

n 
n
−1
γ1 = x2i +λ xi yi + λ−1 γ
i=1 i=1
−1 2

n 
n 
n
−1 2 −1 −1
b1 = b + yi2 +λ γ − x2i +λ x i yi + λ γ .
i=1 i=1 i=1
BAYES RULES 197
−2
In case of Jeffreys prior πJeff (θ) ∝ σ we obtain
n
1 
n
i=1 xi yi
β L2 = n
2
2 = β and σL2 = n − 3 (yi − xi β)2 .
i=1 xi i=1

We illustrates this simple model with the life-length data.

Example 7.9 (Life length vs. life line)


We continue with Example 6.4 based on data for Example 6.3. We study
the centred model (6.34). Assuming σ 2 = 160 and β ∼ N(0, 100), we obtain
βL2 = −2.257. In Example 6.10 the variance is supposed to be unknown and
a conjugate prior is assumed. Then the Bayes estimate is βL2 = −2.257. The
estimates are the same and give a negative trend, just opposite to the popular
belief. The equality of estimates is not surprising because the expected value
of the prior of the variance coincides with the supposed known value in the
first case. We come back to this data and model later to assess the significance
of this trend. 2

Now let us compare the estimates under a simple linear model with intercept.

Example 7.10 (Life length vs. life line)


We continue with Example 6.4 based on data for Example 6.3 but now we
assume model (6.33) with σ 2 = 160. Assuming the independent priors

α ∼ N(0, 100 σ 2 ), β ∼ N(0, 100 σ 2 ),

we obtain
αL2 = 86.69, βL2 = −2.13.
Alternatively, assuming

α ∼ N(100, 100 σ 2 ), β ∼ N(0, 100 σ 2 ),

again independent, we get

αL2 = 88.04, βL2 = −2.27.

The priors differ only in the expected value of the intercept. Also in this study
we get a negative trend against the popular belief. We will revisit these results
later. 2

We consider an example of Bayes estimation in the linear mixed model.


198 ESTIMATION
Example 7.11 (Hip operation)
In Example 6.15 the posterior distributions are derived. The estimate for the
slope β1 is β1 = 0.5 with Var(β1 |y) = 0.0054. It implies that the covariate age
has influence on the healing process. 2

7.2.2 Estimation in Multivariate Linear Models


We consider the model P given in (6.131), i.e.,

Y = XB + E, E ∼ MNn,d (O, In , Σ)

where Y is the n × d matrix of responses, E is the n × d matrix of random


errors, B is the p × d matrix of unknown parameters and X is the n × p
design matrix, assumed to be of full rank. The unknown parameter consists
of B and Σ. Under both Bayes models {P, πc }, with conjugate prior, and
{P, πJeff }, with non-informative prior, the posterior distributions belong to
the normal-inverse-Wishart distributions given in Theorems 6.11 and 6.12:

(B, Σ)|Y ∼ NIW(ν1 , B1 , C1 , Σ1 ) and B|Y ∼ tp,d (ν1 − d + 1, B1 , C1 , Σ1 ).

This delivers the estimates


1
BL2 = B1 and Σ = Σ1 .
ν1 − d − 1
We illustrate the estimation of the covariance matrix Σ in the following ex-
ample.

Example 7.12 (Corn plants)


Continuing with Example 6.16 on page 172, in model {P, πJeff } we obtain
⎛ ⎞
1 149.09 −24.99
Σ= S=⎝ ⎠.
14 −24.99 264.29

For model {P, πc } we get


⎛ ⎞
1 249.10 28.58
Σ= Σ1 = ⎝ ⎠.
19 28.58 240.37

2
CREDIBLE SETS 199
7.3 Credible Sets
In this section we deal with Bayes confidence regions. We assume the Bayes
model {P, π}, where P = {Pθ : θ ∈ Θ} and π is the known prior distribution
of θ over Θ. Having observed x for X ∼ Pθ we want to determine a region
Cx ∈ Θ such that the underlying parameter θ ∈ Cx . According to the Bayesian
inference principle that all information is included in the posterior π(θ|x), we
define the set Cx as following, where HPD stands for highest posterior
density.

Definition 7.2 (Credible Region) Assume the Bayes model {P, π}


with posterior distribution Pπ (.|x). A set Cx is a α-credible region iff

Pπ (Cx |x) ≥ 1 − α, α ∈ [0, 1]. (7.22)

This region is called HPD α-credible region if it can be written as

{θ : π(θ|x) > kα } ⊆ Cx ⊆ {θ : π(θ|x) ≥ kα } (7.23)

where kα is the largest bound such that

Pπ (Cx |x) ≥ 1 − α.

In case the posterior is a continues distribution with density π(θ|x) the defi-
nition simplifies to
Cx = {θ : π(θ|x) ≥ kα } with Pπ (π(θ|x) ≥ kα |x) = 1 − α. (7.24)
We illustrate it by the following examples.

Example 7.13 (Normal i.i.d. sample and normal prior)


Consider an i.i.d. sample X = (X1 , . . . , Xn ) from N(μ, σ 2 ) with known vari-
ance σ 2 , thus θ = μ. We assume the prior θ ∼ N(μ0 , σ02 ). In Example 2.12 on
page 16, the posterior is derived as
x̄nσ02 + μ0 σ 2 σ02 σ 2
N(μ1 , σ12 ), with μ1 = 2 , and σ12 = .
nσ0 + σ 2 nσ02 + σ 2
The posterior has a continuous density and is unimodal and symmetric around
the Bayes estimate θ = μ1 . We calculate HPD α-credible interval by
 
1 (θ − μ1 )2
{θ : π(θ|x) ≥ kα } = {θ : √ exp − ≥ kα }
2πσ1 σ12
(7.25)
(θ − μ1 )2 √
= {θ : ≤ ln(kα 2πσ1 )}
σ12
200 ESTIMATION

0.30
0.30

0.25
0.25

0.20
0.20

posterior
posterior

0.15
0.15

0.10
0.10

0.05
0.05

0.00
0.00

[ ] [ ]

0 2 4 6 8 10 0 2 4 6 8 10

theta theta

Figure 7.5: Example 7.25. Left: HPD interval for α = 0.5 with interval length 5.88.
Right: Credible interval for α = 0.5, which is not HPD. The interval length is 7.12.

The bound kα is calculated by


 α
ln(kα (2π)σ1 ) = (z1− α2 )2 , where P(Z < z1− α2 ) = 1 − , Z ∼ N(0, 1).
2
The HPD α-credible interval is given as

Cx = {θ : μ1 − z1− α2 σ1 ≤ θ ≤ μ1 + z1− α2 σ1 },

which has the same structure as the frequentist confidence interval, but now
around the Bayes estimate. 2

Let us consider a case where the posterior density is skewed.

Example 7.14 (Normal i.i.d. sample and inverse-gamma prior)


Recall Example 2.14 on page 19. Consider an i.i.d. sample X = (X1 , . . . , Xn )
from N(0, σ 2 ) with unknown variance σ 2 . We set as prior an inverse-gamma
distribution InvGamma(α, β).Then the posterior is InvGamma(α1 , β1 ) with
n
α1 = α + n/2 and β1 = β + i=1 x2i /2 which has the density
 α1 +1  
β α1 1 β1
f (θ|α1 , β1 ) = 1 exp − . (7.26)
Γ(α1 ) θ θ

The density is unimodal but skewed. The HPD α-credible interval is given by
 α1 +1  
1 β1
{θ : π(θ|x) ≥ kα } = {θ : exp − ≥ cα } (7.27)
θ θ
CREDIBLE SETS 201
0.8

0.8
0.6

0.6
posterior

posterior
0.4

0.4
0.2

0.2
0.0

0.0
[ ] [ ]

0 2 4 6 8 0 2 4 6 8

variance variance

Figure 7.6: Example 7.14. Left: HPD interval for α = 0.1 with interval length 2.46.
The right tail probability is around 0.097. Right: Credible interval for α = 0.1, which
is equal–tailed. The interval length is 3.19.

where cα is a constant calculated from kα . The explicit interval requires a


numerical solution; see R code and Figure 7.6. This interval is not symmetric
around a Bayes estimator. It is not equal-tailed like the usual approximation
in the frequentist approach; see Liero and Zwanzig (2011, p.171). 2

The following R code illustrates the construction of an HPD α-credible interval


for a unimodal skewed density; see also Figure 7.7.
R Code 7.3.8. Example 7.14

library(invgamma)
a<-3; b<-3 # posterior parameter
aa<-seq(0,0.1,0.000001) # confidence levels on the lower side
q1<-qinvgamma(aa,a,b) # quantile lower bound
q2<-qinvgamma(aa+0.9,a,b) # quantile upper bound
plot(aa,dinvgamma(q1,a,b),"l") # see Figure
lines(aa,dinvgamma(q2,a,b))
# Simple search algorithm gives the crossing point at 0.078
k<-0.078; lines(aa,rep(k,length(aa)))
# Simple searching algorithm gives approximative bounds
L<-0.30155 ; U<- 2.765,
# such that k=dinvgamma(L,a,b) and k=dinvgamma(U,a,b).
Unfortunately in case of multi-modal densities, the interpretation of HPD
credible intervals is not so convincing. We demonstrate it with the help of
the classroom example (Example 3.7, page 37) about the two contradicting
weather experts.
202 ESTIMATION

0.6
0.5
0.4
0.3
0.2
0.1
0.0

0.00 0.02 0.04 0.06 0.08 0.10

Figure 7.7: Illustration for R code related to Example 7.14.

Example 7.15 (Weather)

Recall Example 3.15 on page 48. Continuing with Example 7.7, the poste-
rior is a normal mixture distribution shown in Figure 7.4. Depending on the
confidence level α, it may happen that the α-credible region consists of two
separate intervals as demonstrated in Figure 7.8. 2
0.20

0.20
0.15

0.15
posterior

posterior
0.10

0.10
0.05

0.05
0.00

0.00

[ ] [ ] [ ]

−5 0 5 10 −5 0 5 10

theta theta

Figure 7.8: Example 7.15. Left: HPD interval for α = 0.05. Right: HPD interval for
α = 0.1.
CREDIBLE SETS 203
The interpretation of HPD credible regions can be complicated for discrete
posterior distributions, i.e., for the parameter space Θ ⊆ Z. In this case the
HPD α-credible regions do not have to be unique. We discuss it with the help
of the lion example on page 7.

Example 7.16 (Lion’s appetite)


Consider the model in Example 2.4 and the prior in Example 3.16, with
π(θ1 ) = π(θ2 ) = 0.1. For x = 1 the likelihood is constant, such that prior and
posterior coincide, i.e.,

θ1 θ2 θ3
.
π(θ|x = 1) 0.1 0.1 0.8

Set α = 0.1. We have two HPD α-credible regions: C1 = {θ1 , θ3 } and


C2 = {θ2 , θ3 }. The first region C1 includes lion’s hungry and lethargic modes;
second region contains moderate and lethargic modes. 2

7.3.1 Credible Sets in Linear Models


We assume the linear model

y = Xβ + , ∼ Nn (0, σ 2 In ), θ = (β, σ 2 ).

Further we have that the posterior is θ|y ∼ NIG(a1 , b1 , γ1 , Γ1 ), i.e.,

β|y, σ 2 ∼ Np (γ1 , σ 2 Γ1 )
b1
β|y ∼ t(a1 , γ1 , Γ1 ) (7.28)
a1
σ 2 |y ∼ InvGamma(a1 /2, b1 /2).

The construction of credible regions depends only on the posterior distribu-


tion. Therefore we can apply the calculations in Example 7.14 for credible
intervals of σ 2 . The posterior distribution of β in case of known σ 2 is normal;
in case of unknown σ 2 it is a multivariate t-distribution. Both distributions
are elliptical, i.e., the level sets are ellipsoids. Thus the HPD α-credible regions
are ellipsoids, centered at the Bayes estimator γ1 . First we consider the case
of known σ 2 . It holds that there exists a constant cα such that
1
{β : π(β|y, σ 2 ) ≥ kα } = {β : (β − γ1 )T Γ−1
1 (β − γ1 ) ≤ cα }.
σ2
We have that the quadratic form
1
Q= (β − γ1 )T Γ−1
1 (β − γ1 ) ∼ χp ,
2
σ2
204 ESTIMATION
where χ2p 2
denotes the χ –distribution with p degrees of freedom. Thus the
HPD α-credible region is the ellipsoid
1
Cy = {β : (β − γ1 )T Γ−1
1 (β − γ1 ) ≤ χp (1 − α)}
2
σ2
where χ2p (1 − α) is the (1 − α)–quantile of χ2 –distribution with p degrees of
freedom. In case of unknown σ 2 we have
{β : π(β|y) ≥ kα } = {β : f (β) ≥ kα } (7.29)
where f is the density of t(a1 , γ1 , ab11 Γ1 ) given in (6.51). Thus there exists a
constant cα such that
1
Cy = {β : (β − γ1 )T Γ−1
1 (β − γ1 ) ≤ cα }. (7.30)
b1
We apply the following relation between t-distribution and F-distribution; see
Lin (1972). If X ∼ tq (ν, μ, Σ) then
1
(X − μ)T Σ−1 (X − μ) ∼ Fq,ν , (7.31)
q
where Fq,ν denotes the F-distribution with q and ν degrees of freedom. Let
Fp,a1 (1 − α) be the 1 − α quantile of F-distribution with p and a1 degrees of
freedom. The HPD α-credible set is given by
a1
Cy = {β : (β − γ1 )T Γ−1
1 (β − γ1 ) ≤ Fp,a1 (1 − α)}. (7.32)
p b1
Recall that the Bayes estimate of σ 2 is b1 /(a1 − 2).

Example 7.17 (Life length vs. life line)


We analyze the data for Example 6.3 on page 130 under the simple linear re-
gression model with unknown variance; θ = (α, β, σ 2 ). In Example 6.4 on page
139, the posterior distribution is derived for σ 2 = 160. Here we additionally
set a prior on σ 2 , so that
⎛ ⎞ ⎛⎛ ⎞ ⎛ ⎞⎞
α 100 100 0
⎝ ⎠ ∼ N 2 ⎝⎝ ⎠ , σ 2 ⎝ ⎠⎠ and σ 2 ∼ InvGamma(5, 640).
β 0 0 100

with E(σ 2 ) = 160. Applying Theorem 6.5 we obtain


θ|y ∼ NIG(a1 , b1 , γ1 , Γ1 )
with a1 = a + n = 58, b1 = 7554 and
⎛ ⎞ ⎛ ⎞
1.346 −0.146 88.04
Γ1 = ⎝ ⎠ , γ1 = ⎝ ⎠.
−0.146 0.016 −2.27
CREDIBLE SETS 205

2
0
slope

−2
−4
−6

60 80 100 120

intercept

Figure 7.9: Example 7.17. The larger ellipse is the HPD credible region for α = 0.05;
the smaller is for α = 0.1.

The credible region for (α, β) is an ellipse given in (7.32) where the F–quantiles
are F2,a1 (0.9) = 2.39 and F2,a1 (0.95) = 3.15, illustrated in Figure 7.9. 2

R Code 7.3.9. Credible Region, Figure 7.9.

qf1<-qf(0.95,2,a1) # quantile F distribution


qf2<-qf(0.9,2,a1)
b<-gamma1 # Bayes estimate
S<-Gamma1
Q<-function(x,y){a1/(2*b1)*t(c(x,y)-b)%*%solve(S)%*%(c(x,y)-b)}
#Quadratic form
x1<-seq(45,130,1) # intercept
x2<-seq(-7,3,0.1) # slope
Z<-as.matrix(outer(xx1,xx2))
for (i in 1:length(xx1))
{ for (j in 1: length(xx2))
{ Z[i,j]<-Q(x1[i],x2[j])
}}
contour(x1,x2,Z,xlab="intercept",ylab="slope",level=c(qf1,qf2),
drawlabels=FALSE)
206 ESTIMATION
7.4 Prediction
We consider the problem to predict a future observation xf ∈ Xf . We refer
to Section 4.3.3 for theoretical background. The Bayes model {P, π} with the
posterior π(θ|x) is assumed. The future data point xf is generated by a distri-
bution Qθ , possibly depending on the data x. Setting z = xf , we have q(z|θ, x).
The main tool is the predictive distribution. It takes over the role of the poste-
rior. The predictive distribution is defined as the conditional distribution
of the future data point given the data x;

π(z|x) = q(z|θ, x)π(θ|x) dθ. (7.33)
Θ

Analogous to the estimation problem, depending on different strategies, the


Bayes predictor can be mode, expectation or median of the predictive distribu-
tion. In Section 4.3.3 it is shown that the expectation is the Bayes predictor
which is optimal with respect to a quadratic loss,

xf (x) = z π(z|x) dz. (7.34)
Xf

The prediction regions Xpred ⊂ Xf of level 1 − α are defined by



P(Xpred |x) = π(z|x) dz = 1 − α. (7.35)
Xpred

The best choice with minimal volume are the sets with highest predictive
probability;

π(z|x) dz = 1 − α and π(z|x) ≥ π(z  |x) for z ∈ Xpred , z  ∈
/ Xpred .
Xpred
(7.36)

The main problem is now to derive the predictive distribution. In this sec-
tion we present several cases, where it is possible to derive explicit formulas.
Otherwise computer-intensive methods, explained in Chapter 9, can be used.
In the following example we have an i.i.d. sample and want to predict the next
observation.

Example 7.18 (Poisson distribution and gamma prior)


Consider an i.i.d. sample X = (X1 , . . . , Xn ) from Poi(λ) such that
n  xi
 
λ
p(x|λ) = exp(−λ) , (7.37)
i=1
xi !

and for θ = λ n
(θ|x) ∝ θ i=1 xi
exp(−nθ). (7.38)
PREDICTION 207
Assume the conjugate prior Gamma(α, β)

β α α−1
π(θ) = θ exp(−βθ), (7.39)
Γ(α)

where α and β are known; α/β stands for a guess of θ and β for the sample
size this guess is based on. We obtain

π(θ|x) ∝ π(θ)(θ|x)
n
∝ θα−1 exp(−βθ) θ i=1 xi
exp(−nθ) (7.40)
n
xi −1
∝ θα+ i=1 exp(−(β + n)θ),
n
hence the posterior is Gamma(α + i=1 xi , β + n). Summarizing for an i.i.d.
Poisson sample we have

n
θ ∼ Gamma(α, β) and θ|x ∼ Gamma(α + xi , β + n); (7.41)
i=1

see Figure 7.10. The Bayes estimator of θ is the expectation of the posterior,
n
α + i=1 xi
θ L2 = .
β+n
n
For the predictive distribution, set sn = i=1 xi . We have Xf ∼ Poi(θ),
independent of x, and

P(Xf = k|x) = q(k|θ, x)π(θ|x) dθ
Θ
 ∞ k
θ (β + n)α+sn α+sn −1
= exp(−θ) θ exp(−(β + n)θ) dθ
0 k! Γ(α + sn )

(β + n)α+sn ∞ α+sn +k−1
= θ exp(−(β + n + 1)θ) dθ.
Γ(α + sn )k! 0
(7.42)

Applying the integral


 ∞
xa−1 exp(−b x) dx = Γ(a)b−a (7.43)
0

for a = α + sn + k and b = β + n + 1, we obtain the predictive distribution


 α+sn  k
Γ(α + sn + k) β+n 1
P(Xf = k|x) = . (7.44)
Γ(α + sn )k! β+n+1 β+n+1

Set r = α + sn and 1 − p = (β + n + 1)−1 . The predictive distribution is the


208 ESTIMATION
0.8

0.20
likelihood
prior
posterior
0.6

0.15
0.4

0.10
0.05
0.2

0.00
0.0

● ● ● ● ● ● ● ●

0 2 4 6 8 0 2 4 6 8 10 12

lambda k

Figure 7.10: Example 7.18. Left: Likelihood function of an i.i.d Poisson sample with
n = 5, prior Gamma(3, 2) and the posterior Gamma(28, 7). Right: The predictive
distribution NB(28, 0.875). The mode is 3. The predictive region {1, 2, 3, 4, 5, 6, 7, 8}
has the level 0.946.

generalized negative binomial distribution NB(r, p), where r is positive and


real, and the expected value is (1−p)r
p . We obtain the Bayes predictor

1 n
xf (x) = (α + xi ), (7.45)
β+n i=1

so that Bayes predictor and Bayes estimator coincide. This is no surprise, since
the parameter θ = λ is the expectation of Poi(λ). In general, the expectation
is not an integer. We recommend instead the mode of the predictive distribu-
tion. The mode of the generalized negative binomial distribution NB(r, p) is
the largest integer equal or less than p(r−1)
1−p , r > 1. It is illustrated in Figure
7.10. 2

The next example is related to a dynamic model where the future observation
depends on the past.

Example 7.19 (Autoregressive model AR(1))


Set x0 as given. We observe xT = (x1 , . . . , xT ) from XT = (X1 , . . . , XT )
generated by the first-order autoregressive model, given as

Xt = ρxt−1 + εt , εt ∼ N(0, σ 2 ) i.i.d., t = 1, . . . , T. (7.46)

The unknown parameter is θ = (ρ, σ 2 ). It holds that

1 − ρ2t 2
Var(Xt |θ) = ρ2 Var(Xt−1 |θ) + σ 2 = σ , |ρ| < 1. (7.47)
1 − ρ2
PREDICTION 209
We assume the conjugate prior NIG(a0 , b0 , γ0 , κ0 ), so that
a0 b0
ρ|σ 2 ∼ N(γ0 , σ 2 κ0 ), and σ 2 ∼ InvGamma( , ). (7.48)
2 2
We derive an iterative formula for the posterior by setting the prior equal to
the previous posterior: see Section 2.3.1. Set T = 1. Applying Theorem 6.5 we
obtain the posterior NIG(a1 , b1 , γ1 , κ1 ), i.e.,
a 1 b1
ρ|σ 2 , x1 ∼ N(γ1 , σ 2 κ1 ), and σ 2 |x1 ∼ InvGamma( , )
2 2
with
a 1 = a0 + 1
x20 −1
b1 = b0 + (x1 − γ0 x0 )2 (1 + )
κ0
γ0 (7.49)
γ1 = κ1 (x0 x1 + )
κ0
1 −1
κ1 = (x20 + ) .
κ0
Note that, for x0 = 0 and γ0 = 0 we have
a1 = a0 + 1, b1 = b0 + x21 , γ1 = 0, and κ1 = κ0 . (7.50)
Set T = 2. Applying Theorem 6.5 we obtain the posterior
θ|(x0 , x1 , x2 ) ∼ NIG(a2 , b2 , γ2 , κ2 )
with
a 2 = a1 + 1
x21 −1
b2 = b1 + (x2 − γ1 x1 )2 (1 + )
κ1
γ1 (7.51)
γ2 = κ2 (x1 x2 + )
κ1
1 −1
κ2 = (x21 + ) .
κ1
Note that, also for γ1 = 0 we have γ2 = 0 a.s. In general we have the iterative
formula for the posterior
θ|(x0 , . . . , xT ) ∼ NIG(aT , bT , γT , κT ) (7.52)
with
aT = aT −1 + 1
x2T −1 −1
bT = bT −1 + (xT − γT −1 xT −1 )2 (1 + )
κT −1
γT −1 (7.53)
γT = κT (xT −1 xT + )
κT −1
1
κT = (x2T −1 + )−1 .
κT −1
210 ESTIMATION
We are interested in the distribution of XT +1 given xT = (x0 , . . . , xT ). The
posterior θ|xT ∼ NIG(aT , bT , γT , κT ) and (7.46) imply

ρ|σ 2 , xT ∼ N(γT , σ 2 κT )
(7.54)
XT +1 |(ρ, σ 2 , xT ) ∼ N(ρxT , σ 2 ).

From Theorem 6.1, we get

XT +1 |(σ 2 , xT ) ∼ N(γT xT , σ 2 (κT x2T + 1)). (7.55)

Further, we have
a T bT
σ 2 |xT ∼ InvGamma(
, ).
2 2
Applying Lemma 6.5, we obtain the predictive distribution
bT
XT +1 |xT ∼ t1 (aT , γT xT , (κT x2T + 1)). (7.56)
aT
Prediction regions are of the form
aT
{x : (x − γT xT )2 (κT x2T + 1)−1 ≤ const}. (7.57)
bT
Recall that,

(X − μ)
for X ∼ t1 (ν, μ, σ 2 ), it holds ∼ tν , (7.58)
σ
where tν is students t-distribution with ν degrees of freedom. Setting tν,1− α2
for the (1 − α2 )–quantile, we obtain the prediction interval
 $ $ 
bT bT
γT xT − taT ,1− α2 2
(κT xT + 1), γT xT + taT ,1− α2 2
(κT xT + 1) .
aT aT
(7.59)
2

7.4.1 Prediction in Linear Models


We assume the linear model

y = Xβ + , ∼ Nn (0, σ 2 In ), θ = (β, σ 2 ),

with the posterior θ|y ∼ NIG(a1 , b1 , γ1 , Γ1 ), i.e.,

β|y, σ 2 ∼ Np (γ1 , σ 2 Γ1 )
(7.60)
σ 2 |y ∼ InvGamma(a1 /2, b1 /2).
PREDICTION 211
We are interested in the prediction of q future observations generated by

yf = Zβ + ε, ε ∼ Nq (0, σ 2 Σf ), (7.61)

where the q × p matrix Z is known, the parameter θ = (β, σ 2 ) is the same as


above. The error term ε is independent of . We have

yf |(β, σ 2 , y) ∼ Nq (Zβ, σ 2 Σf )
(7.62)
β|y, σ 2 ∼ Np (γ1 , σ 2 Γ1 ).

From Theorem 6.1, we obtain


 
yf |σ 2 , y ∼ Nq Zγ1 , σ 2 (ZΓ1 ZT + Σf ) . (7.63)

Further, we have
σ 2 |y ∼ InvGamma(a1 /2, b1 /2).
Applying Lemma 6.5, we obtain the predictive distribution
b1
yf |y ∼ tq (a1 , Zγ1 , (ZΓ1 ZT + Σf )). (7.64)
a1
The multivariate t-distribution tq (ν, μ, Σ) belongs to the elliptical class. The
level sets are ellipsoids around the location μ; the matrix Σ gives the shape of
the ellipsoid. Using (7.31) the prediction region is
a1
{z : (z − Zγ1 )T (ZΓ1 ZT + Σf )−1 (z − Zγ1 ) ≤ Fq,a1 (1 − α)} (7.65)
qb1

where Fq,a1 (1 − α) is the 1 − α quantile of Fq,a1 . For q = 1 we have

b1
yf |y ∼ t1 (a1 , yf , σf2 ), , with yf = Zγ1 , σf2 = (ZΓ1 ZT + Σf ) (7.66)
a1
and
y f − yf
∼ ta 1 (7.67)
σf
where ta1 is the t-distribution with a1 degrees of freedom. Let ta1 (1 − α) be
the quantile of ta1 , then we obtain the prediction interval:
α α
[yf − ta1 (1 − )σf , yf + ta1 (1 − )σf ] (7.68)
2 2

Example 7.20 (Prediction in quadratic regression)


We consider a polynomial regression relation

yi = xi + x2i + εi , i = 1, . . . , 21. (7.69)


212 ESTIMATION
A data set, using (7.69), is generated with 21 equidistant design points between
−1 and 1; εi ∼ N(0, 0.52 ) i.i.d. The generated data are

x −1 −0.9 −0.8 −0.7 −0.6 −0.5 −0.4


y 0.470 0.098 −0.680 0.144 −0.486 −1.407 0.439
x −0.3 −0.2 −0.1 0 0.1 0.2 0.3
y −0.632 −0.641 −0.644 −0.688 0.265 0.670 0.373
x 0.4 0.5 0.6 0.7 0.8 0.9 1
y −0.157 0.703 0.546 1.632 0.790 1.968 1.887

We consider the quadratic regression model with no intercept

yi = β1 xi + β2 x2i + εi , i = 1, . . . , 21, εi ∼ N(0, 0.52 ), i.i.d.

and set the prior NIG(a0 , b0 , γ0 , Γ0 ) with a0 = 10, b0 = 1, γ0 = (0, 0)T , and
Γ0 = 2I2 . The posterior is NIG(a1 , b1 , γ1 , Γ1 ) with a1 = 31, b1 = 7.12, γ1 =
(0.903, 0.869)T , and ⎛ ⎞
0.122 ≈0
Γ1 = ⎝ ⎠.
≈0 0.180
We set α = 0.1, then the quantile is ta1 ≈ 1.70. Applying (7.68) the prediction
interval is calculated at xf = 1.1 as [1.08, 3.01] and at xf = 1.4 as [1.84, 4.10].
Figure 7.11 illustrates the example. 2

7.5 List of Problems


1. Consider an i.i.d. sample X = (X1 , . . . , Xn ) from Gamma(ν, θ). The param-
eter of interest is the rate parameter θ. Assume θ ∼ Gamma(α, β).
(a) Derive the posterior.
(b) Determine the L2 estimator for θ.
(c) Determine the MAP estimator for θ.
(d) Give a procedure for determining the HPD α-credible interval.
2. Consider an i.i.d. sample X = (X1 , . . . , Xn ) from N(μ, σ 2 ). The parameter
of interest is θ = (μ, σ 2 ). Assume a conjugate prior.
(a) Determine the maximum likelihood estimators for θ.
(b) Are the maximum likelihood estimators of μ and σ 2 correlated?
Calculate Cov(μMLE , σMLE 2
|θ) .
(c) Determine the MAP estimator for θ.
(d) Are the MAP-estimators of μ and σ 2 correlated?
Calculate Cov(μMAP , σMAP 2
|θ) for θ = (0, 1).
LIST OF PROBLEMS 213

5

4


3

● ●
2



1

● ●

● ● ●


● ●
0



● ● ● ● ●
−1

−1.0 −0.5 0.0 0.5 1.0 1.5

Figure 7.11: Example 7.20.

n
3. Consider the simple linear regression model with i=1 xi = 0

yi = α + βxi + εi , i = 1, . . . , n, εi ∼ N(0, σ 2 ) i.i.d.

The unknown parameter is θ = (α, β) ∼ N2 (γ, σ 2 diag(λ1 , λ2 )), the error


variance is known.
(a) Derive the expression for an α0 -credible interval C(z) for μ(z) = α + βz
such that
P(μ(z) ∈ C(z)|y) ≥ 1 − α0
and the width of the band is as small as possible.
(b) Compare C(z) with the respective prediction interval at z.
4. Let X|θ ∼ Bin(n, θ) be observed and θ ∼ Beta(α0 , β0 ). The future data
point has the distribution Xf |θ ∼ Bin(nf , θ).
(a) Determine the predictive distribution π(xf |x).
(b) Set α0 = β0 = 1, n = 5, x = 3 and nf = 1. Calculate the predictive
distribution. (Use R.)
5. We are interested in two regression lines

yi = αy + βy xi + εi , i = 1, . . . , n, εi ∼ N(0, σ 2 ) i.i.d.
(7.70)
zi = αz + βz xi + ξi , i = 1, . . . , n, ξi ∼ N(0, σ 2 ) i.i.d.,
n n
where εi and ξi are mutually independent and i=1 xi = 0, i=1 x2i = n.
The unknown parameter is θ = (αy , βy , αz , βz , σ 2 ). We are mainly inter-
ested in η = 12 (βy + βz ).
214 ESTIMATION
(a) Re–write model (7.70) as univariate model with response variable y =
(y1 , . . . , yn , z1 , . . . , zn )T .
(b) Assume prior θ ∼ NIG(a, b, m, Γ) with Γ = diag(λ1 , λ2 , λ3 , λ4 ).
Derive the posterior distribution of η.
(c) Assume that σ 2 is known. Determine the HPD α-credible interval
C1 (y, σ 2 ) for η.
(d) Assume that σ 2 is unknown. Determine the Bayesian L2 estimate σ̃12 for
σ2 .
(e) Assume that σ 2 is unknown. Determine the HPD α-credible interval
C1 (y) for η.
n n
6. Consider model (7.70), with i=1 xi = 0 and i=1 x2i = n. Set

1 1
ui = (yi +zi ) = αu +βu xi + i , i ∼ N(0, σ 2 ), i.i.d., i = 1, . . . , n, (7.71)
2 2
where αu = 12 (αy + αz ), βu = 12 (βy + βz ) and i = 12 (εi + ξi ). The unknown
parameter is θu = (αu , βu , σ 2 ) and θu ∼ NIG(a, b, 0, diag(c1 , c2 )).
(a) Derive the posterior distribution of βu .
(b) Assume that σ 2 is known. Determine the HPD α-credible interval
C2 (u, σ 2 ) for βu .
(c) Assume that σ 2 is unknown. Determine the Bayesian L2 estimate σ̃22 for
σ2 .
(d) Assume that σ 2 is unknown. Determine the HPD α-credible interval
C2 (u) for βu .
7. Consider the two models (7.70) and (7.71). The parameter of interest is
η = 12 (βy +βz ) = βu . Compare the results in Problems 5 and 6, particularly:
(a) Specify prior distributions such that (η, σ 2 ) have the same prior in both
models.
(b) Assume that σ 2 is known. Compare the HPD α-credible intervals for η.
(c) Assume that σ 2 is unknown. Compare the Bayesian L2 estimates for σ 2 .
(d) Assume that σ 2 is unknown. Compare the HPD α-credible intervals for
η.
8. Related to Problem 10 in Chapter 6. Assume two correlated regression lines

yi = αy + βy xi + εi , i = 1, . . . , n, εi ∼ N(0, σ12 ) i.i.d.


(7.72)
zi = αz + βz xi + ξi , i = 1, . . . , n, ξi ∼ N(0, σ22 ) i.i.d.,
n n 2
where Cov(y, z) = σ12 with i=1 xi = 0, i=1 xi = n. The unknown
2 2
parameter is θ = (αy , βy , αz , βz , σ1 , σ12 , σ2 ). Assume the same conjugate
prior for θ as in Problem 10 of Chapter 6, θ ∼ NIW(ν, 0, diag(λ1 , λ2 ), I2 ).
We are mainly interested in η = 12 (βy + βz ).
(a) Determine the estimators of βy , βz and η.
LIST OF PROBLEMS 215
(b) Derive the posterior distribution of η given Σ. Give the HPD α-credible
interval C3 (Y, Σ) for η for known Σ. Setting σ1 = σ2 = σ known and
σ12 = 0, compare this HPD α-credible interval C3 (Y, Σ) with the inter-
val C1 (y, σ) in model (7.70). Hint:

If Z ∼ MN(M, U, W) then AZB ∼ MN(AMB, AUAT , BWBT ).


(7.73)
(c) Derive the posterior distribution of η. Hint: Use the result Gupta and
Nagar (2000, Theorem 4.38), which says that if

T ∼ tm,k (ν, M, U, V) then ATC ∼ ts,l (ν, AMC, AUAT , BVBT ).


(7.74)
(d) Determine the HPD α-credible interval for η.
Chapter 8

Testing and Model Comparison

This chapter deals with the Bayesian approach for hypotheses testing. The
hypotheses are two alternative Bayes models. The goal is to figure out which
of the models can be the right one for the observed data. We set
H0 : M0 = {P0 , π0 } versus H1 : M1 = {P1 , π1 } (8.1)
where
Pj = {Pj,θj : θj ∈ Θj }, θj ∼ πj , j = 0, 1. (8.2)
It is possible that the statistical models are different including the parameter
spaces. Note that, θj is not a component of the parameter θ, rather another
parameter. We give an example for two different Bayes models.

Example 8.1 (Two alternative models) In a medical study, 1000 pa-


tients are randomly chosen and their blood samples are tested for multiple
drug resistance. 15 persons got a positive test result. From earlier studies it
is known that the risk of infection is around 2%. Let X be the number of
infected patients. Two different Bayes models are proposed:
M0 : X ∼ Bin(1000, p), p ∼ Beta(2, 100)
M1 : X ∼ Poi(λ), λ ∼ Gamma(20, 1)
2

In this chapter we take up three different approaches:


- The goal is to decide after experiment which model is the right one. Here
we suppose that both models, M0 and M1 , are generated by splitting of
the common model M = {P, π} with P = {Pθ , θ ∈ Θ}. We apply decision
theoretic results derived in Chapter 4.
- The goal is to obtain evidence against the null hypothesis. We introduce
a model indicator k ∈ {0, 1} and embed both models M0 and M1 in
a common model. The Bayesian principle recommends the model with a
higher posterior probability of its model indicator. The main tool is the
Bayes factor.

DOI: 10.1201/9781003221623-8 216


BAYES RULE 217
- The goal is to compare models by empirical methods which includes model
fit and model complexity. We present the Bayesian information criterion
(BIC) and the deviance information criterion (DIC).

8.1 Bayes Rule


In this section we treat the decision theoretic approach. We consider a more
specific test problem. The Bayes model M = {P, π} is split as follows. Set
P = {Pθ : θ ∈ Θ} and Θ = Θ0 ∪ Θ1 with Θ0 ∩ Θ1 = ∅, and define

Pj = {Pθ : θ ∈ Θj } , j = 0, 1.

We assume that the prior probability, Pπ (Θ0 ) > 0. The prior π on Θ is de-
composed as
π(θ) = Pπ (Θ0 )π0 (θ) + Pπ (Θ1 )π1 (θ)
with Pπ (Θ0 ) + Pπ (Θ1 ) = 1, and
⎧ ⎧
⎨ π(θ) for θ ∈ Θ0 ⎨ π(θ)
for θ ∈ Θ1
Pπ (Θ0 ) Pπ (Θ1 )
π0 (θ) = π1 (θ) = .
⎩ 0 else ⎩ 0 else

The test problem (8.1) is:

H0 : M0 = {P0 , π0 } versus H1 : M1 = {P1 , π1 }

The model Mj is true, if the data generating function is element of Pj for


j = 0, 1. Simplified, the test problem usually is written as

H0 : Θ0 versus H1 : Θ1 . (8.3)

We introduce a test as the decision rule depending on x ∈ X .

Definition 8.1 (Test) A test ϕ is a statistic from the sample space X to


{0, 1}: ⎧
⎨ 1 if x ∈ C (reject H )
1 0
ϕ(x) =
⎩ 0 if x ∈ C (do not reject H )
0 0

where X = C1 ∪ C0 , with C1 ∩ C0 = ∅.

In Chapter 4 on decision theory we already introduced the test (4.29) as Bayes


rule. We can make two different types of error: the error of type I is that M0
is true but we decide for M1 . This error gets the loss value a0 . The opposite
error is the error of type II, that we wrongly decide for M0 , gets the loss
value a1 . The Bayes rule related to this asymmetric loss function can be de-
rived analogous to Theorem 4.5. Here we give the optimal Bayes test.
218 TESTING AND MODEL COMPARISON
Set P (.|x) for the posterior distribution in the common model {{Pθ : θ ∈ Θ}, π}.
π

Then Bayes test is



⎨ 1 if Pπ (Θ |x) < a1
0 a0 +a1
ϕ(x) = . (8.4)
⎩ 0 if Pπ (Θ |x) ≥ a1
0 a0 +a1

Note that Pπ (Θ0 |x) + Pπ (Θ1 |x) = 1. For a0 = a1 the model, whose parameter
space has higher posterior probability, is accepted. This procedure has a nice
heuristic background. Following the Bayesian inference principle, we apply
the ratio of posterior probabilities instead of likelihood ratio. But note that,
it does not give the same answer as in the Neyman–Pearson theory, where first
the probability of type I error is bounded and then the probability of type II
error is minimized. The test in (8.4) treats both errors simultaneously, only
corrected by different weights.

Example 8.2 (Binomial distribution and beta prior)


We consider the Bayes model in Example 2.11, on page 16, given by X|θ ∼
Bin(n, θ) and θ ∼ Beta(α0 , β0 ) with α0 > 1 and β0 > 1. Then θ|x ∼ Beta(α0 +
x, β0 + n − x). We are interested in the testing problem:
1 1
H0 : θ ≥ versus H1 : θ < .
2 2
Both types of errors should have the same weight; we set a0 = a1 . Recall
the properties of Beta(α, β). For α = β the distribution is symmetric around
0.5, for 1 < α < β the distribution is unimodal and negatively skewed. This
implies, for α0 + x < β0 + n − x, we reject the null hypothesis. We obtain the
test ⎧
⎨ 1 if x < 1 (β − α + n)
2 0 0
ϕ(x) = (8.5)
⎩ 0 otherwise

We continue with binomial data.

Example 8.3 (Sex ratio at birth)


The sex ratio at birth (SRB) is defined as male births per female births.
The WHO (World Health Organization) determines the expected SRB by
106 boys per 100 girls. The number of boys has Bin(n, p) distribution with
success probability p = 106/206 = 0.5145631. The Official Statistics of Sweden
counted in 2021 x = 58485 male births and n−x = 55778 female births, which
gives an actual sex rate at birth as 1.04855. For uniform prior, α0 = 1 and
β0 = 1, we obtain the posterior Beta(58486, 55779). We consider the testing
problem
H0 : p ≥ 0.5145631 versus H1 : p < 0.5145631.
BAYES RULE 219

Figure 8.1: Example 8.3. The white storks decide the SRB.

The posterior probability of the null hypothesis is Pπ (Θ0 |x) = 1−0.9669733 =


0.03302665. We conclude that the Swedish sex ratio at birth in 2021 was sig-
nificant less than the value of WHO. This clear conclusion is possible because,
due to high number of births, the posterior variance is very small; see Figure
8.4. 2

Example 8.4 (Normal distribution and normal prior)


Consider X ∼ N(μ, σ 2 ) with known variance σ 2 ; θ = μ with normal prior
θ ∼ N(μ0 , σ02 ). In Example 2.12, on page 16 with n = 1 we derived the
posterior distribution

xσ02 + μ0 σ 2 σ02 σ 2
N(μ1 , σ12 ), with μ1 = and σ 2
1 = . (8.6)
σ02 + σ 2 σ02 + σ 2

We are interested in the testing problem:

H0 : θ ≤ 0 versus H1 : θ > 0.

Calculate
   
θ − μ1 −μ1 −μ1
Pπ (θ ≤ 0|x) = Pπ ≤ |x =Φ
σ1 σ1 σ1

where Φ is the distribution function of N(0, 1). Setting za for the quantile
a1
Φ(za ) = ,
a0 + a1
220 TESTING AND MODEL COMPARISON

0.20
0.20

null hypothesis ● null hypothesis


alternative

0.15
0.15

alternative

posterior
posterior

0.10
0.10

0.05
0.05

0.00
0.00

] ●

−5 0 5 10 −5 0 5 10

theta theta

Figure 8.2: Left: Example 8.4. The null hypothesis is rejected. Right: Illustration for
point-null hypothesis. The null hypothesis is always rejected for any experiment’s
result.

we obtain the test


⎧ 
⎨ 1 if xσ 2 + μ σ 2 > −z σσ σ 2 + σ 2
0 0 a 0 0
ϕ(x) = (8.7)
⎩ 0 otherwise.

For a0 = a1 the test is illustrated in Figure 8.2.


2

We continue with analysis of the life length data set.

Example 8.5 (Life length vs. life line)


We analyse the data presented in Example 6.3, on page 130. In particular,
we want to know if the length of life line is an indicator of life length. This is
formulated with respect to the slope β of the linear regression line as

H0 : β ≥ 0 versus H1 : β < 0.

In Example 7.17 we obtained that the posterior of intercept α and slope β


given the variance σ 2 is the two dimensional normal distribution, N2 (γ1 , σ 2 Γ1 ),
so that the marginal posterior of β given σ 2 is N(−2.27, σ 2 0.016). Fur-
ther, σ 2 |y ∼ InvGamma(a1 /2, b1 /2) with a1 = 58 and b1 = 7554. Applying
Lemma 6.5 we obtain β|y ∼ t1 (58, −2.27, b1 /a1 0.016). Thus, in this model,
Pπ (Θ0 |y) = 0.06 and we reject the popular belief. See Figure 8.3. 2

8.2 Bayes Factor


In this section we consider the approach which formulates evidence against
the null hypothesis. Assume the general test problem with two alternative
BAYES FACTOR 221

0.25
0.20
0.15
0.10
0.05

0.94

0.06
0.00

−8 −6 −4 −2 0 2 4

slope

Figure 8.3: Example 8.5. Bayes model with θ = (α, β, σ 2 ). The null hypothesis is
clearly rejected.

Bayes models,
H0 : M0 = {P0 , π0 } versus H1 : M1 = {P1 , π1 } (8.8)
with Pj = {Pj,θj : θj ∈ Θj }.
We start with the construction of a common Bayes model Mcom which embed
both models. We introduce an additional parameter, the model indicator and
a prior on it:
k ∈ {0, 1}, π(k) = pk , p0 + p1 = 1, p0 > 0. (8.9)
The parameter space of the common model Pcom is
Θcom = {0} × Θ0 ∪ {1} × Θ1
with parameter θ = (k, θk ) ∈ Θcom , such that
Pcom = {P0 ∪ P1 : θ ∈ Θcom }.
The common prior is the mixture
π(θ) = π((k, θk ))
= π((k, θk )|k = 0)π(k = 0) + π((k, θk )|k = 1)π(k = 1) (8.10)
= π0 (θ0 )p0 + π1 (θ1 )p1 .
We get the common Bayes model
Mcom = {Pcom , π}. (8.11)
222 TESTING AND MODEL COMPARISON
The parameter of interest is the model indicator k. Following the Bayesian
inference principle, the posterior distribution p(k|x) is the main source of
information. Applying Bayes theorem we obtain
p(k)p(x|k)
p(k|x) = , (8.12)
p(0)p(x|0) + p(1)p(x|1)
where p(x|k) is the marginal distribution of the data x ∈ X given the model
Mk . To compare the models, the ratio is of main interest is
p(0|x) p(x|0) p(0)
= . (8.13)
p(1|x) p(x|1) p(1)
We have

p(x|k) = p(x, θk |k) dθk
Θk

= pk (x, θk ) dθk (8.14)
Θk

= k (θk |x)πk (θk ) dθk
Θk

We get 
p(0|x) 0 (θ0 |x)π0 (θ0 ) dθ0 p(0)
= Θ0 . (8.15)
p(1|x)  (θ |x)π1 (θ1 ) dθ1 p(1)
Θ1 1 1
The information from the data is included in the Bayes factor, defined as
follows.

Definition 8.2 (Bayes factor) Assume the test problem (8.8). The
Bayes factor is 
0 (θ0 |x)π0 (θ0 ) dθ0
π 
B01 = B01 = Θ0 .
 (θ |x)π1 (θ1 ) dθ1
Θ1 1 1

Applying (8.15) we get the relation between the prior odds ratio,
p(0) p(0)
=
p(1) 1 − p(0)
and the posterior odds ratio,
p(0|x) p(0|x)
=
p(1|x) 1 − p(0|x)
with the help of the Bayes factor:
p(0|x) p(0)
= B01
1 − p(0|x) 1 − p(0)
i.e.,
BAYES FACTOR 223

posterior odds = Bayes factor × prior odds

The Bayes factor measures the change in the odds. Small Bayes factor B01
means evidence against the null hypothesis. There are different empirical scales
for judging the size of Bayes factor. We quote here a table from Kass and
Raftery (1997). Note that, the table is given for B10 = (B01 )−1 .

2 ln(B10 ) B10 Evidence against H0


0 to 2 1 to 3 Not worth more than a bare mention
2 to 6 3 to 20 Positive
6 to 10 20 to 150 Strong
> 10 > 150 Very strong

Example 8.6 (Two alternative models)


A
We continue with Example 8.1 and calculate the Bayes factor B01 = B. The
prior in model M0 is Beta(α0 , β0 ). It holds that

A := 0 (θ0 |x)π0 (θ0 ) dθ0
Θ0
 1
n x 1
= θ (1 − θ)n−x θα0 −1 (1 − θ)β0 −1 dθ
0 x B(α 0 , β 0 )
   1
n 1
= θα0 +x−1 (1 − θ)β0 +n−x−1 dθ
x B(α0 , β0 ) 0
 
n B(α0 + x, β0 + n − x)
= .
x B(α0 , β0 )

Using  
n 1 1
= ,
x n + 1 B(x + 1, n − x + 1)
we obtain
1 B(α0 + x, β0 + n − x)
A= .
n + 1 B(x + 1, n − x + 1)B(α0 , β0 )
224 TESTING AND MODEL COMPARISON
The prior in model M1 is Gamma(α1 , β1 ). It holds that

B := 1 (θ1 |x)π1 (θ1 ) dθ1
Θ1
 ∞
λx β α1
= exp(−λ) 1 λα1 −1 exp(−β1 λ) dλ
0 x! Γ(α1 )
 ∞
β1α1
= λx+α1 −1 exp(−λ(β1 + 1)) dλ.
x!Γ(α1 ) 0
Using  ∞
Γ(x + α1 )
λx+α1 −1 exp(−λ(β1 + 1)) dλ = ,
0 (β1 + 1)α1 +x
we obtain
β1α1 Γ(x + α1 )
B= .
x!Γ(α1 ) (β1 + 1)α1 +x
In general the Bayes factor A/B has a difficult expression which divides large
values. In case α0 = β0 = 1, we can use B(1, 1) = 1 and can simplify A to
1/(n + 1). Setting also α1 = β1 = 1 in B, we obtain
A 1
B01 = = 2x+1 .
B n+1
2
In the special case where we formulate the alternatives by splitting the com-
mon model {P, π}, as in Section 8.1 we can re-formulate the Bayes factor as
follows. The prior probability pk of the submodel Mk coincides with Pπ (Θk ),
k = 0, 1 and
 
1
k (θk |x)πk (θk ) dθk = π (θ|x)π(θ) dθ
Θk P (Θk ) Θk

m(x)
= π π(θ|x) dθ (8.16)
P (Θk ) Θk
m(x) π
= π P (Θk |x)
P (Θk )

where m(x) = Θ (θ|x)π(θ) dθ. We obtain, for the test problem (8.3), the
Bayes factor as
Pπ (Θ0 |x) Pπ (Θ1 )
Bπ01 = π .
P (Θ1 |x) Pπ (Θ0 )
Note that, Pπ and Pπ (.|x) are the prior and posterior probabilities related to
the common model {P, π}. In this case (8.4) is the optimal decision rule. Both
approaches, decision making or evidence finding, can be transformed to each
other. We have the relation:
a1
Reject H0 ⇔ Pπ (Θ0 |x) <
a0 + a1
a1 p1
⇔ B01 <
π
.
a 0 p0
BAYES FACTOR 225
8.2.1 Point Null Hypothesis
Given {P, π} with P = {Pθ : θ ∈ Θ}, a point-null hypothesis is defined by
splitting Θ = Θ1 ∪ {θ0 }, where Θ1 = Θ \ {θ0 }, written as

H0 : θ = θ0 versus H1 : θ = θ0 .

Under H0 we have P0 = {Pθ0 } and under the alternative P1 = {Pθ : θ ∈ Θ1 }.


The Bayes test problem of a point-null hypothesis corresponds to the compar-
ison of the Bayes models

H0 : M0 = {P0 , π0 } versus H1 : M1 = {P1 , π1 }

with two different priors. The prior of the point-null model is the Dirac mea-
sure on θ0 , ⎧
⎨ 1 if θ = θ
0
π0 (θ) =
⎩ 0 otherwise.

The prior π1 of the alternative is defined on Θ1 ; for continuous priors it is


defined on Θ as well by π1 (θ) = π(θ). Note that, the priors π0 and π1 are
not generated by a split of a common prior π as in (8.10). A formal applica-
tion of the test (8.4) gives a useless result, because for continuous posteriors,
Pπ (Θ0 |x) = 0 and the null hypothesis is never accepted; see Figure 8.2.
The way out is the concept of Bayes factor and using the model indicator
k = 0, 1. We set the prior probability p0 on k = 0. Then the common prior,
mixing π0 and π1 , is

πcom (θ) = p0 π0 (θ) + (1 − p0 )π1 (θ). (8.17)

This prior distribution is called zero-inflated. It has a distribution function


with a jump of height p0 at θ0 . For continuous prior π1 we have
 
p(x|θ)π1 (θ) dθ = p(x|θ)π(θ) dθ = m(x).
Θ1 Θ

Further 
p(x|θ)π0 (θ) dθ = p(x|θ0 ),
Θ0

hence
p(x|θ0 )
B01 = . (8.18)
m(x)

Example 8.7 (Binomial distribution and beta prior)


We continue with Example 8.2. Now we are interested in the two-sided testing
problem:
1 1
H0 : θ = versus H1 : θ = .
2 2
226 TESTING AND MODEL COMPARISON
1
The prior on H0 is given by ρ0 = 2. The prior under the alternative is U(0, 1).
We have    n
n 1
p(x|θ0 ) =
x 2
and     
1
n x n
m(x) = θ (1 − θ)n−x dθ = B(x + 1, n − x + 1).
0 x x
Using  
n 1
= ,
k (n + 1)B(k + 1, n − k + 1)
1
we obtain m(x) = n+1 and
   n
n 1
B01 = (n + 1) .
x 2
2

In Example 8.3 we studied the sex ratio at birth in Sweden and concluded
that the rate is less than the expected rate given by WHO. Now we want to
check the related one–point hypothesis.

Example 8.8 (Sex ratio at birth)


We continue with Example 8.3 and set θ0 = 0.5145631. We are interested in
the testing problem
H0 : θ0 versus H1 : θ = θ0 .
We take a uniform prior with α0 = 1 and β0 = 1 and set p0 = 12 . It holds that
 
n x
p(x|θ0 ) = θ (1 − θ0 )n−x ≈ 0.000436.
x 0

Thus the Bayes factor

B01 = p(x|θ0 )(n + 1) = 114264 × 0.000436 ≈ 49.82, B10 = 0.02

Applying the table above we have no evidence against H0 . 2

In likelihood testing theory we can test a one–point hypothesis with the help of
a confidence region. The same method is possible in Bayes inference applying
HPD credible regions. Assume that

θ0 ∈ Cx = {θ : π(θ|x) ≥ k}, Pπ (π(θ|x) ≥ k|x) = 1 − α(k),

then
p(x|θ0 )π(θ0 )
π(θ0 |x) = >k
m(x)
BAYES FACTOR 227
250

250
200

200
posterior

posterior
150

150
100

100
0.967
50

50
● ●
0

0
0.505 0.510 0.515 0.520 0.505 0.510 0.515 0.520

theta theta

Figure 8.4: Sex ratio at birth (SRB) data. Left: Example 8.3: The one-sided test
rejects the null hypothesis. The SRB in Sweden is significantly less than the value
of WHO. Right: Example 8.9: The WHO value lies inside of credible region. There
is no evidence against the one–point null hypothesis.

and
p(x|θ0 ) k
B01 = > .
m(x) π(θ0 )
Small B01 delivers evidence against H0 , but here we have a lower bound.
Roughly speaking, if θ0 ∈ Cx we will find no evidence against H0 .
We illustrate it with the Swedish SRB data.

Example 8.9 (Sex ratio at birth)


We continue with the example above. The posterior distribution is concen-
trated on a small interval between 0.505 and 0.520. The HPD α-credible
interval for α = 0.05 is [0.5089, 0.5147]. It includes the theoretical value
θ0 = 0.5145. We get no evidence against H0 ; see Figure 8.4. 2

8.2.2 Bayes Factor in Linear Model


We consider two alternative linear models, which differ in the regression part,
while the error distributions are the same:

H0 : M0 = {P0 , π0 } versus H1 : M1 = {P1 , π1 } (8.19)


where, for j = 0, 1,
% &
Pj = Nn (X(j) βj , σ 2 Σ) : θj = (βj , σ 2 ), βj ∈ Rpj , σ 2 ∈ R+

and
πj (θj ) = πj (βj |σ 2 )π(σ 2 ).
This set up can be illustrated as follows.
228 TESTING AND MODEL COMPARISON
Example 8.10 (Comparison of regression curves) We fit two different
models to the same data and want to figure out which model is better. Con-
sider two polynomial regression curves of different degree but same error as-
sumptions. Model M0 is a cubic regression:

P0 : yi = β0,0 + β1,0 xi + β2,0 x2i + β3,0 x3i + εi , εi ∼ N(0, σ 2 ), i.i.d.

Model M1 is a quadratic regression:

P1 : yi = β0,1 + β1,1 xi + β2,1 x2i + εi , εi ∼ N(0, σ 2 ), i.i.d.

To simplify the calculations let us consider an arbitrary linear model


%% & &
Nn (Xβ, σ 2 Σ) : θ = (β, σ 2 ), β ∈ Rp , σ 2 ∈ R+ , , π

with posterior NIG(a, b, γ, Γ). Using the known posterior we calculate the in-
tegral 
m(y) = (θ|y)π(θ)dθ.
Θ
Set
(θ|y)π(θ) = c0 cπ ker(θ|y)
where cπ is the constant from the prior and
 n
1 1
c0 = √ .
2π |Σ|1/2

Further, for
π(θ|y) = c1 ker(θ|y)
with the same kernel function ker(θ|y) and
 p   a2
1 1 b 1
c1 = √
2π |Γ|1/2 2 Γ( a2 )

we obtain
c0 cπ
m(y) = .
c1
Note that, the constant c0 is the same in both models. Summarizing, we obtain
the Bayes factor
(0) (1)
m(0) (y) cπ c
B01 = (1) = (1) 1(0) ,
m (y) cπ c1
BAYES FACTOR 229
where the superscript indicates the corresponding model. We consider the
conjugate prior and Jeffreys prior. For conjugate priors, we assume
(j) (j)
βj |σ 2 ∼ Npj (γ0 , σ 2 Γ0 ), σ 2 ∼ InvGamma(a0 /2, b0 /2).

The constants of variance prior are the same in both models. We get
1
(0)
cπ √ |Γ0 |
(1) 2

(1)
= ( 2π)p1 −p0 (0)
.
cπ |Γ0 |

Now we calculate the ratio of the constants of NIG(a(0) , b(0) , γ (0) , Γ(0) ) and
NIG(a(1) , b(1) , γ (1) , Γ(1) ). Note that, a(0) = a(1) = a0 + n, where all other
parameters differ. We get
  12   a02+n
(1)
c1 √ p0 −p1 |Γ(0) | b(1)
= ( 2π) .
(0)
c1 |Γ(1) | b(0)

Summarizing, we have
1
(1)
1
  a02+n
|Γ(0) | |Γ0 |
2 2
b(1)
B01 = . (8.20)
(0)
|Γ0 | |Γ(1) | b(0)

Applying (6.67) we have


b(j) = b0 + Ridge(j)
with
Ridge(j) = (y − y(j) )T Σ−1 (y − y(j) ) + pen(j)
where

y(j) = X(j) γ (j) , pen(j) = (γ0 − γ (j) )T (Γ0 )−1 (γ0 − γ (j) ).
(j) (j) (j)

We assume, there exist positive matrices C(j) , j = 0, 1 such that


1 T −1
X Σ X(j) → C(j) , j = 0, 1. (8.21)
n (j)
From (6.64) we get
 
|(Γ0 )−1 + XT(1) Σ−1 X(1) |
(1)
|Γ(0) | |C(1) |
= = n(p1 −p0 ) + o(1) .
|Γ(1) | (0) −1 −1
|(Γ0 ) + X(0) Σ X(0) |
T |C(0) |

We transform the Bayes factor and take only the leading terms for n → ∞
into account, such that

2 ln(B01 ) ≈ (a0 + n) ln(Ridge(1) + b0 ) − ln(Ridge(0) + b0 ) + (p1 − p0 ) ln(n).


230 TESTING AND MODEL COMPARISON
Hence, we prefer the model with the smaller Ridge criterion, penalized by the
number of parameters.
Now, for Jeffreys prior we assume the same prior in both models:
1
π(θj ) ∝ .
σ2
The Bayes factor is calculated by the ratio of the posterior constants. Under
Jeffreys prior all posterior parameters are different. We obtain
  12 (1) n−p1
(1)
c1 √ (p0 −p1 ) |Γ(0) | ( b2 ) 2 Γ( n−p
2 )
0

B01 = = π .
(0)
c1 |Γ(1) | (0)
( b2 )
n−p0
2 Γ( n−p
2 )
1

Applying (6.85) and setting (y − y(j) )T Σ−1 (y − y(j) ) = RSS(j) we have


  n−p j
n−pj n−pj 1 2
(b(j)
) 2 =n 2 RSS(j) .
n
Using the approximation,

Γ(x + α) ≈ Γ(x)xα , for x → ∞,

we get
p1 −p0
Γ( n−p
2 )
0
n 2
n−p1 ≈ .
Γ( 2 ) 2
Under (8.21) we have
 
|Γ(0) | |XT(1) Σ−1 X(1) | |C(1) |
= = n(p1 −p0 ) + o(1) .
|Γ(1) | |XT(0) Σ−1 X(0) | |C(0) |

We transform the Bayes factor and take only the leading terms for n → ∞
into account, so that

2 ln(B01 ) ≈ n(ln(RSS(1) /n) − ln(RSS(0) /n)) + (p1 − p0 ) ln(n).

Hence we prefer the model with the smaller BIC (Bayes information criterion,
or Schwarz criterion; Schwarz (1978)). In the linear normal model, BIC is
given as (see Subsection 8.3.1)
RSS
BIC = n ln( ) + p ln(n).
n

8.2.3 Improper Prior


We can also apply improper priors, given that posterior is proper. A surprising
fact, however is that the test based on an improper prior cannot be approx-
imated by tests based on priors with increasing variances. This is known as
BAYES FACTOR 231
Jeffreys–Lindley paradox. We illustrate one of the complications by the fol-
lowing example.

Example 8.11 (Improper prior)


Let a single observation x from X ∼ N(θ, 1) and consider Jeffreys prior π(θ) ∝
1. The test problem is

H0 : θ = 0 versus H1 : θ = 0.

The Bayes factor for point-null hypothesis given in (8.18) is

p(x|θ = 0)
B01 =
m(x)

with  ∞  ∞
1 (x − θ)2
m(x) = p(x|θ) dθ = √ exp(− ) dθ = 1
−∞ −∞ 2π 2
and
1 x2 1
p(x|θ = 0) = √ exp(− ) ≤ √ .
2π 2 2π
Hence
1
B01 ≤ √ ≈ 0.4.

This procedure favours rejecting H0 . 2

The pseudo Bayes factor delivers a possible way out from the problems re-
garding improper priors. The idea is to explore the iterative structure of Bayes
methods. The sample is split into a training sample, sufficiently large to derive
a proper posterior, and this posterior is set as new proper prior. The training
sample should be chosen as small as possible. Following example demonstrates
the applicability of this idea.

Example 8.12 (Training sample)


Consider two observations (x1 , x2 ), i.i.d, from X ∼ N(μ, σ 2 ). Set θ = (μ, σ 2 )
and assume Jeffreys prior π(θ) ∝ σ12 . The posterior is proper, if the integral
m(x) exists. Using (x1 − μ)2 + (x2 − μ)2 = s2 + 2(x̄ − μ)2 , we have

m(x) = (θ|x)π(θ) dθ
  
1 (x1 − μ)2 + (x2 − μ)2 1
∝ exp − dθ
σ2 2σ 2 σ2
 ∞   ∞  
1 s2 (x̄ − μ)2 1
∝ exp − exp − dθ.
0 σ3 2σ 2 −∞ σ2 σ
232 TESTING AND MODEL COMPARISON
Applying   

(x̄ − μ)2 1 √
exp − dμ = 2π,
−∞ σ2 σ
and     12
 ∞
1 s2 2 1
exp − dσ 2
= Γ( ),
0 σ3 2σ 2 s2 2
we obtain m(x) < ∞. The sample size n = 2 implies a proper posterior. 2

8.3 Bayes Information


The goal of information criteria is to summarize the quality of a model in
one number. When comparing models, the model with the smaller number is
preferred. The common structure of information criteria is that the quality of
the model fit is penalized by the complexity of the model. The main idea is
to support less complex models with a sufficient good model fit.

8.3.1 Bayesian Information Criterion (BIC)


Assume a Bayes model {P, π} with P = {Pθ : θ ∈ Θ ∈ Rp } and
non-informative prior π. Given the data x, the model fit is measured us-
ing the maximum of the log-likelihood function and the model complexity
is the weighted dimension of the parameter space. Schwarz (1978) introduced
the Bayes information criterion, also known as Schwarz information criterion,
as follows.

Definition 8.3 (Bayes information criterion) The Bayes information


criterion BIC = BIC(x) is defined as

BIC = −2 max ln((θ|x)) + p ln(n) + const.


θ∈Θ

The term const is the same for two competing models, such that it does not
change the result. We calculate BIC for the normal linear model with Jeffreys
prior
%% & & 1
Nn (Xβ, σ 2 Σ) : θ = (β, σ 2 ), β ∈ Rp , σ 2 ∈ R+ , πJeff ; πJeff ∝ 2 .
σ
BAYES INFORMATION 233
Recall that, the data are now y and X denotes the design matrix. The log-
likelihood l(θ|y) = ln (θ|y) is
n 1 1
l(θ|y) = − ln(2πσ 2 ) − ln(|Σ|) − 2 (y − Xβ)T Σ−1 (y − Xβ). (8.22)
2 2 2σ
Its maximum is attained at

βΣ = (XT Σ−1 X)−1 XT Σ−1 y,

and
1
σ2 = RSS, RSS = (y − XβΣ )T Σ−1 (y − XβΣ ).
n
Thus
1
−2 max l(θ|x) = n ln( RSS) + n ln(2π) + ln(|Σ|) + n
θ∈Θ n
and with const = −n − ln(|Σ| − n ln(2π)) we obtain
1
BIC = n ln( RSS) + p ln(n). (8.23)
n

8.3.2 Deviance Information Criterion (DIC)


This popular criterion was introduced by Spiegelhalter et al. (2002). Assume
a Bayes model {P, π} with P = {Pθ : θ ∈ Θ ∈ Rp }. The deviance is defined
by the log-likelihood l(θ|x) = ln (θ|x) as

D(θ) = −2l(θ|x) + const (8.24)

where the constant term is independent of the model and has no influence to
model choice. The model fit is measured by the posterior expectation of the
deviance. The complexity of the model is given by

pD = E(D(θ)|x) − D(E(θ|x)). (8.25)

Definition 8.4 (Deviance information criterion) The Deviance in-


formation criterion DIC = DIC(x) is defined as

DIC = E(D(θ)|x) + pD .

We calculate DIC for the normal linear model

y = Xβ + ε, ε ∼ Nn (0, σ 2 Σ).
234 TESTING AND MODEL COMPARISON

Theorem 8.1
Assume the Bayes model
%% & &
Nn (Xβ, σ 2 Σ) : θ = (β, σ 2 ), β ∈ Rp , σ 2 ∈ R+ , π

with a prior π such that

θ|y ∼ NIG(a, b, γ, Γ).

Then
b a a+2  
DIC = n ln( )−2nΨ( )+ FIT+2 tr Γ XT Σ−1 X +const, (8.26)
a−2 2 b
where Ψ(.) is the digamma function and

FIT = (y − Xγ)T Σ−1 (y − Xγ). (8.27)

Proof: The log-likelihood is given in (8.22), so that


1
D(θ) = n ln(σ 2 ) + (y − Xβ)T Σ−1 (y − Xβ) + const
σ2
with const = nln(2π) + ln(|Σ|). In the following we apply the posterior mo-
ments
b
E(β|y) = γ, E(σ 2 |y) = , Cov(β|y, σ 2 ) = σ 2 Γ
a−2
and
a b a
E(σ −2 |y) = , E(ln(σ 2 )|y) = ln( ) − Ψ( ). (8.28)
b 2 2
We decompose the quadratic form

(y − Xβ)T Σ−1 (y − Xβ) =(y − Xγ)T Σ−1 (y − Xγ)


+ 2(y − Xγ)T Σ−1 X(γ − β)
+ (γ − β)T XT Σ−1 X(γ − β)

and obtain

E((y − Xβ)T Σ−1 (y − Xβ)|y, σ 2 ) = FIT + E((γ − β)T XT Σ−1 X(γ − β)|y, σ 2 ).

Further
 
E((γ − β)T XT Σ−1 X(γ − β)|y, σ 2 ) =E(tr (γ − β)(γ − β)T XT Σ−1 X|y, σ 2 )
 
=tr Cov(β|y, σ 2 )XT Σ−1 X
 
=σ 2 tr Γ XT Σ−1 X .
BAYES INFORMATION 235
It follows that
1
E(D(θ)|y) = nE(ln(σ 2 )|y) + E( (y − Xβ)T Σ−1 (y − Xβ)|y) + const
σ2
b a a  
= n(ln( ) − Ψ( )) + FIT + tr Γ XT Σ−1 X + const.
2 2 b
(8.29)

Further
2 a 2  
pD = n(ln( ) − Ψ( )) + FIT + tr Γ XT Σ−1 X . (8.30)
a−2 2 b
Summarizing, we obtain the statement.
2
In the following we calculate the difference of DIC between two Bayes linear
models
ΔDIC = DIC(0) − DIC(1) .
We assume the same set up as in Subsection 8.2.2, for comparing models with
different linear regression functions but the same error distribution.

M0 : y = X(0) β0 + ε, M1 : y = X(1) β1 + ε, (8.31)

with ε ∼ Nn (0, σ 2 Σ) and β0 ∈ Rp0 , β1 ∈ Rp1 . We consider conjugate priors


and Jeffreys prior. First we assume conjugate priors which fulfill πj (θj ) =
πj (βj |σ 2 )π(σ 2 ), j = 0, 1. Thus
(j) (j) (j) (j) (j)
θj ∼ NIG(a0 , b0 , γ0 , Γ0 ), θj |y ∼ NIG(a1 , b1 , γ1 , Γ1 ), j = 0, 1.

The expressions for the posteriors are given in Theorem 6.5. Note that the
parameters a0 and a1 are the same in both models. Applying Theorem 8.1 for
j = 0, 1 we calculate
ΔDIC = D(0) − D(1)
where
a1 + 2
FIT(j) + 2tr Γ1 XT(j) Σ−1 X(j) .
(j) (j)
D(j) = n ln(b1 ) + (j)
(8.32)
b1

Especially

b1 = b0 + FIT(j) + (γ0 − γ1 )T (Γ0 )−1 (γ0 − γ1 ), a1 = a0 + n


(j) (j) (j) (j) (j) (j)

and
Γ1 = ((Γ0 )−1 + XT(j) Σ−1 X(j) )−1
(j) (j)

with
XT(j) Σ−1 y + (Γ0 )−1 γ0
(j) (j) (j) (j)
γ1 = Γ1 .
236 TESTING AND MODEL COMPARISON
Under (8.21) we have
 −1
1 (j) −1 1 T −1
= C−1
(j)
nΓ1 = (Γ ) + X(j) Σ X(j) (j) + o(1)
n 0 n

and
 
(j) 1
tr Γ1 XT(j) Σ−1 X(j) = tr nΓ1 XT(j) Σ−1 X(j) = pj + o(1).
(j)
n

Secondly we assume Jeffreys prior


  pj2+2
1
πj (θj ) ∝ .
σ2

The posterior is derived in Theorem 6.7, where

b1 = RSS(j) , a1 = n, (Γ1 )−1 = XT(j) Σ−1 X(j)


(j) (j) (j)

Applying Theorem 8.1 we get

ΔDIC = D(0) − D(1)

where, for j = 0, 1,
RSS(j)
D(j) = n ln + 2 pj . (8.33)
n
Note that, in this case, DIC coincides with Akaike information criterion
(AIC).

8.4 List of Problems


1. Consider an i.i.d. sample X1 , . . . , Xn from Poi(θ) and a conjugate prior.
(a) Calculate the posterior distribution.
(b) For a symmetric loss function, derive the Bayes rule for testing

H0 : θ ≥ 1 versus H1 : θ < 1. (8.34)


n
(c) Set the prior as Gamma(2, 1). Given n = 20, i=1 xi = 15, carry out
the test.
2. Consider an i.i.d. sample X1 , . . . , Xn from Gamma(α, θ), with density

θα α−1
f (x|θ) = x exp(−θx) (8.35)
Γ(α)

where α > 2 is known. The prior is θ ∼ Gamma(α0 , β0 ).


(a) Calculate the posterior distribution.
LIST OF PROBLEMS 237
(b) Consider the one point test problem H0 : θ = θ0 versus H1 : θ = θ0 .
Calculate the Bayes factor B01 .
(c) Suppose the Bayes factor B10 = 200. Which conclusion is possible?
3. Environmental scientists studied the accumulation of toxic elements in ma-
rine mammals. The mercury concentrations (microgram/gram) in the livers
of 28 male striped dolphins (Stenella coeruleoalba) are given in the follow-
ing table, taken from Augier et al. (1993). Set m = 4 and n = 28. Two

1.70 1.72 8.80 5.90 183.00 221.00


286.00 168.00 406.00 286.00 218.00 252.00
241.00 180.00 329.00 397.00 101.00 264.00
316.00 209.00 85.40 481.00 445.00 314.00
118.00 485.00 278.00 318.00

different models are proposed,


P0 = {N(θ, σ12 )⊗m ⊗ N(μ, σ22 )⊗(n−m) : θ ∈ R}, (8.36)
with parameters μ, σ12 , σ22 known, and
P1 = {N(θ, σ12 )⊗m ⊗ N(aθ, σ22 )⊗(n−m) : θ ∈ R}, (8.37)
where the coefficient a is known and σ12 , σ22 are known, and the same as in
(8.36).
(a) Suggest a non-informative prior for each model.
(b) Derive the posterior distributions belonging to the related non-
informative prior.
(j)
(c) Give the expressions of the fitted values xi , i = 1, . . . , n and of RSS(j)
in each model j = 0, 1.
(d) Derive the expression of the Bayes factor B01 for comparing both models.
(e) Set μ = 200, a = 50 and σ12 = 10, σ22 = 10000. Compare RSS(j) j = 0, 1.
Calculate 2 ln(B10 ). Draw the conclusion.
4. Let yij be the length of intensive care of Corona patient i in hospital j,
where i = 1, . . . , 2n, j = 1, 2. Further, the following information of each
patient i is given: x1i age, x2i body mass index, x3i chronic lung disease
and x4i serious heart conditions, where x3i ∈ {0, 1}, x4i ∈ {0, 1}, and
x3i = 1, x4i = 1 mean the existence of the respective chronic disease. The
following linear model is proposed
y1i = μ1 + β1 x1i + β2 x2i + β3 x3i + β4 x4i + ε1i , i = 1, . . . , n
y2i = μ2 + β1 x1i + β2 x2i + β3 x3i + β4 x4i + ε2i , i = n + 1, . . . , 2n,
where εij ∼ N(0, σ 2 ), i.i.d. Define θ = (μ1 , μ2 , β1 , . . . , β4 ). Suppose σ 2 is
known. The parameters θi , i = 1, . . . , 6, are independent with θi ∼ N(0, 1).
238 TESTING AND MODEL COMPARISON
(a) Calculate the posterior distribution of θ.
(c) Calculate the posterior distribution of μ1 − μ2 .
(d) Propose a Bayes test for
H0 : −0.2 ≤ μ1 − μ2 ≤ 0.2 H1 : |μ1 − μ2 | > 0.2.
5. This problem on parallel lines is related to Problem 8 in Chapter 6. Let
yi = αy + βy xi + εi , i = 1, . . . , n, εi ∼ N(0, σ 2 ) i.i.d.
(8.38)
zi = αz + βz xi + ξi , i = 1, . . . , n, ξi ∼ N(0, σ 2 ) i.i.d.,
n
where εi and ξi are mutually independent and i=1 xi = 0. The unknown
parameter is θ = (αy , αz , βy , βz , σ 2 ). We are interested in testing the par-
allelism of lines
H0 : βy = βz = β, H1 : βy = βz .
Under H0 the prior of θ(0) = (αy , αz , β, σ 2 ) is NIG(a, b, 0, I3 ) and under H1
the prior of θ(1) = (αy , αz , βy , βz , σ 2 ) is NIG(a, b, 0, I4 ).
(a) Formulate the Bayes model under H0 and derive the posterior
(0)
NIG(a1 , b1 , γ (0) , Γ(0) ) .
(b) Formulate the Bayes model under H1 and derive the posterior
(1)
NIG(a1 , b1 , γ (1) , Γ(1) ).
1 (0) (1)
(c) Show that n (b1 − b1 ) = n+1
2n+1 (βy − βz )2 + rest(n), with
limn→∞ rest(n) = 0, where βy and βz are the Bayes estimates of βy
and βz under the alternative.
(d) Calculate the Bayes factor B01 .
6. Variance test in linear regression. We assume a univariate model
y = Xβ + , ∼ N(0, σ 2 In ),
with Jeffreys prior π(θ) ∝ σ −2 . We are interested in testing
H0 : σ 2 = σ02 , H1 : σ 2 = σ02 .
(a) Formulate the Bayes model under H0 and derive the posterior.
(b) Formulate the Bayes model under H1 and derive the posterior.
(c) Derive the expression of the Bayes factor B01 .
1 2
σ
(d) Set σ 2 = n−p RSS. Discuss the behavior of 2 ln(B10 ) as function of σ02
.
7. Test on correlation between two regression lines. This problem is related to
Problem 10 in Chapter 6. We assume the model (6.162), i.e.,
yi = αy + βy xi + εi , i = 1, . . . , n, εi ∼ N(0, σ12 ) i.i.d.
(8.39)
zi = αz + βz xi + ξi , i = 1, . . . , n, ξi ∼ N(0, σ22 ) i.i.d.,
n
where Cov(εi , ξi ) = σ12 and i=1 xi = 0. The unknown parameter is θ =
(αy , αz , βy , βz , σ12 , σ22 , σ12 ). We are interested in testing
H0 : σ12 = σ22 , σ12 = 0, H1 : else.
LIST OF PROBLEMS 239
(a) Formulate the Bayes model under H0 with prior NIG(2, 2, 0, I4 ) and de-
rive the posterior.
(b) Formulate the Bayes model under H1 with prior NIW(2, 0, I2 , I2 ) and
derive the posterior.
(c) Set ⎛ ⎞
σ12 σ12
Σ=⎝ ⎠.
σ12 σ22

Derive the Bayes estimate Σ of Σ under H0 and the Bayes estimate Σ of


Σ under H1 . Compare both.
(d) Derive the expression of the Bayes factor B01 .
(e) Show that
2  − ln |Σ| + oP (1).
ln B10 = ln |Σ|
n+1
Chapter 9

Computational Techniques

Following the Bayesian inference principle, all what we need is the posterior
distribution. In general however, we rarely find a known distribution family
which has the same kernel as π(θ) (θ|x). In this case computational methods
help.

In this chapter we present methods for



- computing integrals h(θ, x) π(θ|x) dθ,
- generating an i.i.d. sample θ1 , . . . , θN from π(θ|x), and
- generating a Markov chain with stationary distribution π(θ|x).
The first item is important for computing Bayes estimators, Bayes factors,
and predictive distributions. The other items are useful for MAP estimators,
HPD–regions, studying the shape of the posterior distribution, or to estimate
the posterior density by smoothing methods.

Nowadays we can derive the posterior for almost all combinations of π(θ)
and (θ|x). Furthermore we can also do it in cases where we do not have an
expression of the likelihood function, but only a data generating procedure.
Likewise, a closed form expression of the prior is not required either.

This opens broad fields of applications for Bayesian methods.

Our personal advice is:

Always try to find an analytical expression first!

Even when these expressions require computational methods for special func-
tions, such as the incomplete beta function or digamma function, properties of
these functions are well studied and can help understand the problem better.
It can also be useful to approximate the posterior by analytic expressions, and
in this case we can study the properties of the posterior more generally.

Bayesian computation is an exciting field on its own. Our purpose of writing


this textbook is to explain main principles and the new possibilities. For more
DOI: 10.1201/9781003221623-9 240
DETERMINISTIC METHODS 241

Figure 9.1: The lazy mathematician gives up and lets the computer do it.

details we refer to Albert (2005), Chen et al. (2002), Givens and Hoeting
(2005), Robert and Casella (2010), Lui (2001), Zwanzig and Mahjani (2020,
Chapter 2), and the literature therein.

9.1 Deterministic Methods


In this section we present two methods, which are not based on simulations.
We start with the general method “brute-force”, known from cryptography.

9.1.1 Brute-Force

Algorithm 9.1 Brute-force

1. Discretize Θ ≈ {θ1 , . . . , θN }.
2. Calculate π(θj ) (θj |x) for j = 1, . . . , N .
N
3. Calculate mN (x) = N1 j=1 π(θj ) (θj |x).
4. Approximate π(θj |x) ≈ mN (x) π(θj ) (θj |x).
1
242 COMPUTATIONAL TECHNIQUES

4
prior
likelihood
posterior
3
2
1

x

0

0 2 4 6 8

theta

Figure 9.2: Example 9.1.

We illustrate this method by a toy example.

Example 9.1 (Brute-force)


Assume Θ = [0, 3π], π(θ) ∝ sin(θ)/3 + 1 and X|θ ∼ N(θ, 1). We observe x = 2.
See Figure 9.2 and the following R code. The Bayes estimate is θ = 1.97. 2

R Code 9.1.10. Brute-force, Example 9.1.


theta<-seq(0,3*pi,0.001); L<-length(theta) # discretize
prior<-sin(theta)/3+1 # kernel of prior
x<-2; lik<-rep(0,L)
for(i in 1:L){lik[i]<-exp(-(x-theta[i])^2/2)} # likelihood
m<-mean(prior*lik); post<-prior*lik/m # posterior
theta.hat<-mean(theta*post) # estimate

9.1.2 Laplace Approximation


As second method we present an analytic approximation for n → ∞, exploring
basic Laplace approximation,
  
√ −1/2 1
b(θ) exp(−nh(θ)) dθ = 2πσn exp −nh b + term + O(n−2 ),
n

where θ = arg min h(θ), b = b(θ), h = h(θ) and σ 2 = h (θ)−1 ; see Tierney
et al. (1989).
DETERMINISTIC METHODS 243
For better explanation of Laplace approximation we set b(θ) = 1 and assume

that h(θ) is smooth, has the unique minimum  θ = arg1min h(θ), and h√ (θ) = 0.
Applying Taylor expansion and the integral exp(− 2a (x − b)2 )dx = 2πa we
obtain
 
n
exp(−nh(θ)) dθ ≈ exp −nh − nh (θ − θ) − h (θ − θ)2 dθ
2

n 
= exp −nh − h (θ − θ)2 dθ
2

n
= exp −nh exp − h (θ − θ)2 dθ
2
√ − 12
= exp −nh 2πσn ,

where σ 2 = (h )−1 .


Tierney et al. (1989) derived a second order analytic approximation for
  
g(θ)(θ|x)π(θ) dθ bN (θ) exp(−nhN (θ)) dθ
μ(x) = g(θ)π(θ|x) dθ =  =  .
(θ|x)π(θ) dθ bD (θ) exp(−nhD (θ)) dθ

They assume that g(θ) is smooth, and that n is sufficiently large such that
the MAP estimator θMAP is unique. Note that, the second assumption is not
very strong; see the results in Chapter 5. Further, they need that n1 ln((θ|x))
is asymptotically independent of n.

Here we quote only one of their approximation expressions. The hats on bN ,


hN and their derivatives indicate evaluation at θN = arg min hN (θ); and re-
spectively the hats on bD , hD and their derivatives indicate evaluation at
θD = arg min hD (θ). Set σN2
= (hN )−1 and σD
2
= (hD )−1 . Note that, in gen-
eral θN and θD are different, but in case of consistent posteriors they can
converge to each other for n → ∞; see Chapter 5.

σN exp(−nhN ) bN bD bN − bN bD 4  bD bN − bN bD


μ(x) = × 2
+ σD − σD hD
σD exp(−nhD ) bD 2nb2D 2nb2D
+ O(n−2 ).
(9.1)

Depending on the specification of bN , bD we obtain from (9.1) different special


cases. For positive functions g(θ) > 0 we set bN = bD and hN = hD − n1 ln g,
where −nhD = ln((θ|x) + ln(π(θ)) and obtain

bN σN exp(−nhN )
μ(x) = + O(n−2 ). (9.2)
bD σD exp(−nhD )
We illustrate the approximation (9.2) for g(θ) = θ in the following example.
244 COMPUTATIONAL TECHNIQUES
Example 9.2 (Binomial distribution and beta prior)
Set X|θ ∼ Bin(n, θ) and θ ∼ Beta(α0 , β0 ). Then the posterior is Beta(α, β),
with α = α0 + x and β = β0 + n − x; see Example 2.11. We know that the
expectation of Beta(α, β) is
α
μ= .
α+β
We set bN = bD = 1 and further
1
hD (θ) = − ((α − 1) ln(θ) + (β − 1) ln(1 − θ)) ,
n
1
hN (θ) = − (α ln(θ) + (β − 1) ln(1 − θ)) .
n
We calculate
   
 1 α−1 β−1  1 α−1 β−1
hD (θ) = − − , hD (θ) = − − 2 − .
n θ (1 − θ) n θ (1 − θ)2

Setting hD (θ) = 0 we obtain the MAP estimator


α−1 β−1
θD = , 1 − θD = .
α+β−2 α+β−2
Analogously we obtain
α β−1
θN = , 1 − θN = .
α+β−1 α+β−1
Further, we get
   
1 1 1 1 1 1
hD = (α + β − 2)2 + , hN = (α + β − 1)2 + ,
n α−1 β−1 n α β−1

applying (9.2), and obtain the Laplace approximation


 α+β+0.5  α+0.5
α+β−2 α α−1
μ= + O(n−2 ). (9.3)
α+β−1 α−1 α+β−2

For illustration we consider θ ∼ U[0, 1]. Then we have α+β −2 = n. Figure 9.3
shows the convergence in (9.3) for n → ∞ under different values of p = α+βα
,
such that α = p(n + 2), β = (1 − p)(n + 2). 2

9.2 Independent Monte Carlo Methods


For given data x we want to calculate the integral
 
g(θ, x) (θ|x)π(θ) dθ
μ(x) = h(θ, x) π(θ|x) dθ =  . (9.4)
(θ|x)π(θ) dθ
INDEPENDENT MONTE CARLO METHODS ●
245

0.12
0.52

0.11
0.51

Laplace approximation

Laplace approximation




● ● ● ● ● ● ● ● ●

0.10
0.50

● ● ● ● ● ● ● ● ●
● ● ● ●
● ●

0.09
0.49

0.08
0.48

10 15 20 25 10 15 20 25

sample size sample size

Figure 9.3: Laplace approximation in Example 9.2. Left: The symmetric case α = β,
p = 0.5. Right: p = 0.1. Note that, we need n ≥ 8.

The striking idea of Monte Carlo methods (MC) is that, depending on the
factorization,
h(θ, x) π(θ|x) = m(θ, x) p(θ|x), (9.5)
the integral can be rewritten as an expected value

μ(x) = m(θ, x) p(θ|x) dθ.

It can be estimated by a sample from p(θ|x). In case of an i.i.d. sample


θ(1) , . . . , θ(N ) the independent Monte Carlo approximation is given by

1 
N
μ(x) = m(θ(i) , x). (9.6)
N i=1

For 
2
σ 2 (x) = (m(θ, x) − μ(x)) p(θ|x) dθ < ∞, (9.7)

the central limit theorem implies


1
μ(x) = μ(x) + √ σ(x)OP (1) .
N

Note that, the rate √1N cannot be improved for Monte Carlo Methods. The
Monte Carlo approximation of the deterministic integral is a random variable,
which for large N is close to the integral value with high probability.
The factorization ansatz (9.5) is essential. We need a distribution p(θ|x) with
both a small variance and a good random number generator. The choice of the
fraction in (9.4), the application of the MC on the denominator and numerator
246 COMPUTATIONAL TECHNIQUES
separately, and generating a sample from the prior π(θ) may not be good
choices. Since non-informative priors can be improper or have at least a high
variance. Otherwise, priors with small variances have the risk of dominating
the likelihood; see Example 3.1 on page 30.
The following algorithm is called independent MC because it is based on an
independent sample θ(1) , . . . , θ(N ) .

Algorithm 9.2 Independent MC

1. Draw θ(1) , . . . , θ(N ) from distribution p(·|x).



2. Approximate μ(x) = m(θ, x)p(θ|x) dθ by

1
μ(x) = (m(θ(1) , x)) + . . . + m(θ(N ) , x)).
N

We refer to Zwanzig and Mahjani (2020, Section 2.1), Albert (2005, Section
5.7), Robert and Casella (2010, Section 3.2). Here we give only an example
for illustration of two MC methods, one based on sampling from posterior
and the other on twice independent sampling from the prior for separate
approximation of the numerator and dominator in the fraction (9.4).

Example 9.3 (Normal i.i.d. sample and inverse-gamma prior)


Recall Example 2.14 on page 19. The data consist of n = 10 observations
x1 , . . . , x10 from N (0, σ02 ) given in R code 9.2.11. We set θ = σ 2 . The maximum
likelihood estimate is
1 2
n
θMLE = x = 2.05.
n i=1 i

We assume θ ∼ InvGamma(α, β), with α = 4, β = 12, prior expectation 4, and


prior variance 8. The posterior is θ|x ∼ InvGamma(α1 , β1 ), with α1 = 4+n/2 =
9 and β1 = β + n2 θM LE = 22.25; see Example 2.14. The expectation of the
posterior is the Bayes estimate θL2 = α1β−1 1
= 2.782. The posterior variance
is 1.106, essentially smaller than the prior variance. Two different methods
are considered: sampling from the posterior and sampling from the prior. We
(1) (N )
set N = 100. The first method draws θ1 , . . . , θ1 from the posterior, then
approximate
1  (j)
N
μ(x) = θ .
N j=1 1

Note that, for given data x, μ(x) is a random approximation. We get μ(x) =
(1) (N )
2.803. For the second method we draw θ1 , . . . , θ1 from the prior. Then we
INDEPENDENT MONTE CARLO METHODS 247
approximate the denominator integral
 ∞
β + n2 θMLE
θ−(α+ 2 +1) exp(−
n
μDen (x) = )dθ
0 θ

by
1  (j) − n
N n
θMLE
μDen (x) = (θ1 ) 2 exp(− 2 (j) ).
N j=1 θ1
(1) (N )
Further we draw θ2 , . . . , θ2 from the prior. Then we approximate the nu-
merator integral
 ∞
β + n2 θMLE
θ−(α+ 2 ) exp(−
n
μNum (x) = )dθ
0 θ

by
1  (j) − n −1
N n
θMLE
μNum (x) = (θ2 ) 2 exp(− 2 (j) ).
N j=1 θ 2

The approximation of the Bayes estimate by the second method is given by


the quotient
μNum (x) μNum (x)
θL2 = ≈ = 2.716
μDen (x) μDen (x)
Observe that, the true underlying parameter σ02 = 2.25. In Figure 9.4 the
approximation of μ(x) for increasing simulation size is shown. Both methods
are compared in Figure 9.6. 2

R Code 9.2.11. Independent MC, Example 9.3.

x<-c(-0.06,2.43,1.55,1.76,1.27,0.94,0.48,1.16,1.32,1.81)# data
n<-length(x) # sample size
MLE<-sum(x**2)/n # maximum likelihood estimate
N<-100 # simulation size
library(invgamma)
a0<-4; b0<-12 # prior parameters
b0^2/((a0-1)^2*(a0-2)) # prior variance
a1<-a0+n/2; b1<-b0+MLE*n/2 # posterior parameters
b1^2/((a1-1)^2*(a1-2)) # posterior variance
b1/a1-1# Bayes estimate
# Sampling from the posterior (MC1)
MC1<-function(N,a1,b1)
{mu<-mean(rinvgamma(N,a1,b1)); return(mu)}
MC1(N,a1,b1)# MC1 approximation of the Bayes estimate
# Sampling from the prior (MC2)
MC2<-function(N,a0,b0){
248 COMPUTATIONAL TECHNIQUES

4.0
MC approximation

3.5
3.0
2.5

true
MLE
2.0

0 50 100 150 200

Simulation size N

Figure 9.4: Example 9.3. The MC approximations converge to the Bayes estimate
(straight line) with increasing simulation size N . But even for large N the approxi-
mations still have random deviations around the Bayes estimate. The Bayes estimate
and the maximum likelihood estimate differ from the true parameter. For them, only
a convergence for increasing sample size n is possible.

theta1<-rinvgamma(N,a0,b0) # simulate from the prior


lik1<-rep(0,length(theta1)) # likelihood function
for ( i in 1:length(theta1))
{lik1[i]<-theta1[i]**(-n/2)*exp(-n/2*MLE/theta1[i])}
m0<-mean(lik1); # MC approximation
theta2<-rinvgamma(N,a0,b0) # simulate from the prior
lik2<-rep(0,length(theta2)) # likelihood function
for ( i in 1:length(theta2))
{lik2[i]<-theta2[i]**(-n/2)*exp(-n/2*MLE/theta2[i])}
m1<-mean(theta2*lik2) # MC approximation
return(m1/m0)}
MC2(N,a0,b0)# MC2 approximation of Bayes estimate

9.2.1 Importance Sampling (IS)


IS is another approximation method for the integral

μ(x) = h(θ, x) π(θ|x) dθ.

As independent MC the importance sampling method explores an independent


sample and the law of large numbers.
INDEPENDENT MONTE CARLO METHODS 249
For an efficient MC method, we search for a generating distribution p(θ|x)
in (9.5) with small variance. The importance sampling method is based on
another approach. The main idea is to sample from a trial distribution g(θ|x),
and not from the target distribution, π(θ|x), and to correct the sampling from
the wrong distribution by weights, called importance. The weights are the
likelihood quotient of the target and the trial distribution.
The trial distribution g should be easy to sample, and its support must include
the support of h(θ, x)π(θ|x). We do not need a closed form expression of g,
but we should be able to calculate the density at arbitrary points. The choice
of the trial is essential for the performance of the procedure. High importance
weights are a bad sign and imply a high variance of the approximation. The
advice is to avoid this effect by choosing a heavier tail distribution as trial
rather than the target. The trial g(θ|x) is also called importance function or
instrumental distribution.
Importance sampling is particularly recommended for calculating posterior
tail probabilities when the trial distribution is chosen such that more values
are sampled from the tail region. The importance algorithm is given as follows.

Algorithm 9.3 Importance Sampling (IS)


1. Draw θ(1) , . . . , θ(j) , . . . , θ(N ) from the trial distribution g(·|x).
2. For π(θ|x) ∝ k(θ|x), calculate the importance weights

k(θ(j) |x)
w(j) = w(θ(j) , x) = , j = 1, . . . , N.
g(θ(j) |x)

3. Approximate μ(x) = h(θ, x) π(θ|x) dθ by

w(1) h(θ(1) , x) + . . . + w(N ) h(θ(N ) , x)


μ(x) = .
w(1) + . . . + w(N )

Note that, the constant of the posterior is not needed. Usually we apply
π(θ)(θ|x) ∝ k(θ|x). In contrast to independent MC, where the numerator
integral and the denominator integral are approximated separately, we gener-
ate only one sample θ(1) , . . . , θ(N ) .
The IS method is based on a weighted law of large numbers. For π(θ|x) =
c0 (x) k(θ|x), we have

1  (j)
N
P
w h(θ , x) −→ h(θ, x)w(θ, x) g(θ|x) dθ = c0 (x)−1 μ(x),
(j)
N j=1

since
  
h(θ, x)k(θ|x) −1
g(θ|x) dθ = h(θ, x)k(θ|x) dθ = c0 (x) h(θ, x)π(θ|x) dθ,
g(θ|x)
250 COMPUTATIONAL TECHNIQUES
and
 
1  (j) P
N
k(θ|x)
w −→ g(θ|x) dθ = c0 (x)−1 π(θ|x) dθ = c0 (x)−1 ,
N j=1 g(θ|x)

P
such that μ(x) −→ μ(x).
The importance weights are quotients of two densities up to a constant. For
this, we state the following theorem from Lui (2001, Theorem 2.5.1, page 31).

Theorem 9.1 Let f (z1 , z2 ) and g(z1 , z2 ) be two probability densities, where
the support of f is a subset of the support of g. Then,
   
f (z1 , z2 ) f1 (z1 )
Varg ≥ Varg ,
g(z1 , z2 ) g1 (z1 )

where f1 (z1 ) and g1 (z1 ) are the marginal densities.


Proof: Using f1 (z1 ) = f (z1 , z2 )dz2 , g(z1 , z2 ) = g1 (z1 )g2|1 (z2 |z1 ), and
r(z1 , z2 ) = fg(z
(z1 ,z2 )
1 ,z2 )
we obtain

f1 (z1 )
r1 (z1 ) = = r(z1 , z2 )g2|1 (z2 |z1 )dz2 = Eg (r(z1 , z2 )|z1 ).
g1 (z1 )
The statement follows from

Var(r(z1 , z2 )) = E(Var(r(z1 , z2 )|z1 )) + Var(E(r(z1 , z2 )|z1 )) ≥ Var(r1 (z1 )).

2
Theorem 9.1 is based on the Rao–Blackwell Theorem; see Liero and Zwanzig
(2011, Theorem 4.5, page 104). Based on this result, Lui formulated a rule of
thumb for Monte Carlo computation:

“One should carry out analytical computation as much as possible.”

We refer to Zwanzig and Mahjani (2020, Section 2.1.1), Albert (2005, Section
5.9), Robert and Casella (2010, Section 3.3), Chen et al. (2002, Chapter 5)
and Lui (2001, Section 2.5).

Example 9.4 (Normal distribution and inverse-gamma prior)


We continue with the data and estimation problem given in Example 9.3.
We apply importance sampling using the trial distribution Gamma(6, 2) with
density
26 5
g(θ) = θ exp(−2 θ).
Γ(6)
INDEPENDENT MONTE CARLO METHODS 251

0.6
0.5
0.4
0.3
0.2
0.1
0.0

0 2 4 6 8

Figure 9.5: Example 9.3. The continuous line, inverse-gamma distribution, is the
target, and the broken line, the gamma distribution, is the trial.

In Figure 9.5 the posterior and the trial distribution are plotted together. We
set N = 100 and generate observations from the trial, given by θ(1) , . . . , θ(N ) .
Using π(θ)(θ|x) ∝ k(θ|x), with
 α+1+ n2
1 β + n2 θMLE
k(θ|x) = exp(− ),
θ θ

|x)
(j)
the importance weights are calculated by w(j) = k(θ g(θ (j) )
. We obtain
m −6
m
j=1 w (j)
= 2.778851 × 10 , and j=1 w (j)
θ(j) = 7.860973 × 10−6 . Thus
the Bayes estimate approximated by importance sampling is 2.83. Recall that,
the Bayes estimate is 2.78 and σ02 = 2.25. See also the following R code. 2

R Code 9.2.12. Important sampling, Example 9.4.

N<-100
a0<-4; b0<-12 # prior parameters
a<-6; b<-2 # trial parameters
IS<-function(N,a0,b0,MLE,a,b)
{
theta<-rgamma(N,a,b); k<-theta^{-(a0+6)}*exp(-(b0+5*MLE)/theta);
w<-k/dgamma(theta,a,b);
return(sum(w*theta)/sum(w))
}
252 COMPUTATIONAL TECHNIQUES

3.5
3.0
2.5
2.0

MC1 MC2 IS

Figure 9.6: Comparison of methods in Examples 9.3 and 9.4 by violinplots. The
straight line is the Bayes estimate.

MC approximations deliver random numbers. For comparison of methods, we


have to compare distributions. For continuous distributions we recommend
the R function violinplot which plots the density estimation upright and
normalized by a standard hight. To carry out this, we repeat each method
several times. In Examples 9.3 and 9.4 the importance sampling delivers better
results; see the following R code and Figure 9.6.

R Code 9.2.13. Violinplot, Examples 9.3 and 9.4.

library(invgamma);library(UsingR)
M1<-rep(NA,200);M2<-rep(NA,200);M3<-rep(NA,200)
for (i in 1:200){
M1[i]<-MC1(N,a1,b1); M2[i]<-MC2(N,a0,b0)
M3[i]<-IS(N,a0,b0,MLE,a,b)} # run each method 200 times
Mcomb<-c(M1,M2,M3); comb<-c(rep(1,200),rep(2,200),rep(3,200))
violinplot(Mcomb~comb,col=grey(0.8),names=c("MC1","MC2","IS"))

9.3 Sampling from the Posterior


In this section we describe two methods for sampling from the posterior (tar-
get) with the help of a trial distribution. The methods are based on different
approaches. First, we consider the sampling importance resampling (SIR), in-
troduced by Rubins in 1987, which corrects sampling from the trial by weights.
SAMPLING FROM THE POSTERIOR 253
This method is based on a weighted bootstrap procedure and delivers a sam-
ple, approximatively distributed like the target. As other method, we describe
the rejection algorithm, already introduced by von Neumann in 1951. This
method corrects sampling from the wrong distribution by a test. The updated
values are independent and are exactly distributed like the target.

9.3.1 Sampling Importance Resampling (SIR)


The method consists of two sampling steps. The first step samples from the
trial and calculate the importance weights. The second step is a weighted boot-
strap step; it resamples from the first sample using the importance weights.
The algorithm is given as follows.

Algorithm 9.4 Sampling Importance Resampling (SIR)

1. Draw ϑ(1) , . . . , ϑ(m) from the trial distribution g(·|x).


2. For π(θ|x) ∝ k(θ|x), calculate and standardize the importance weights

k(ϑ(j) |x) w(j)


w(j) = , w (j)
=  m , j = 1, . . . , m.
g(ϑ(j) |x) s
i=1 w
(i)

3. Resample θ(1) , . . . , θ(N ) from ϑ(1) , . . . , ϑ(m) with replacement, using


(1) (m)
probabilities ws , . . . , ws .

Thus the target distribution is approximated by the discrete distribution de-


fined on first sample with mass function equal to the importance weights. For
distributional convergence we require N/m → 0. We refer to Albert (2005,
Section 5.10), Robert and Casella (2010, Section 3.3.2), and Givens and Hoet-
ing (2005, Section 6.2.4). Here we continue with Examples 9.3 and 9.4.

Example 9.5 (Normal distribution and inverse-gamma prior)


The data and estimation problem are given in Example 9.3. The goal is to
sample from the posterior InvGamma(a1 , b1 ), where a1 = 9 and b1 = 22.26. We
apply the trial distribution Gamma(6, 2) with density g(θ). We set m = 1000
and generate ϑ(1) , . . . , ϑ(m) from the trial. The importance weights are cal-
(j)
|x)
culated as w(j) = π(ϑ g(ϑ(j) )
. Set N = 100 and resample θ(1) , . . . , θ(m) from
ϑ(1) , . . . , ϑ(N ) . The distribution Psim of θ(1) , . . . , θ(N ) is tested by a one–
sample Kolmogorov–Smirnov test for H0 : Psim = InvGamma(a1 , b1 ); we
get p-value = 0.275. Further a third sample is generated directly from
InvGamma(a1 , b1 ) and a two–sample Kolmogorov–Smirnov test is carried out
to compare this third sample with θ(1) , . . . , θ(N ) , which gives p-value = 0.699.
254 COMPUTATIONAL TECHNIQUES

0.5
0.4
Density

0.3
0.2
0.1
0.0

0 1 2 3 4 5 6 7

Figure 9.7: Example 9.5. The thick continuous line is the target, the thin line is the
trial and the broken line is the density estimation of SIR simulated sample.

Thus we conclude that the SIR method works. See also the following R code
and Figure 9.7. 2

R Code 9.3.14. SIR, Examples 9.3 and 9.4.

library(invgamma)
a0<-4; b0<-12; n<-10; MLE<-2.051
a1<-a0+n/2; b1<-b0+n/2*MLE #posterior parameters
a<-6; b<-2 # trial parameters
m<-1000; y<-rgamma(m,a,b) # step 1
w<-dinvgamma(y,a1,b1)/dgamma(y,a,b); ws<-w/sum(w) # weights
N<-100; yy<-sample(y,N,replace=TRUE,prob=ws)# step 2
ks.test(yy,pinvgamma(a1,b1)) # one sample ks test
z<-rinvgamma(N,a1,b1); ks.test(yy,z) # two sample ks.test

9.3.2 Rejection Algorithm


The rejection algorithm consists of two steps. In the first step a random value
is drawn from the trial distribution; in the second step a test is carried out,
which can reject this value.
The trial distribution g(θ|x) cannot be chosen arbitrarily. Assume that the
SAMPLING FROM THE POSTERIOR 255
target distribution π(θ|x) has kernel k(θ|x); then it is required that there
exists M (x) such that

k(θ|x) ≤ M (x)g(·|x), for all θ ∈ Θ. (9.8)

Note that, the constant of the target distribution is not needed. The condition
(9.8) implies that the trial has heavier tails than the target. We will see that
the algorithm rejects less, when the trial distribution is close to the target and
M (x) is as small as possible. The rejection algorithm generates an independent
sample θ(1) , . . . , θ(j) , . . . , θ(N ) from the target. It is given as following.

Algorithm 9.5 Rejection algorithm

Given a current state θ(j−1) .


1. Draw θ from g(θ|x) and compute the ratio

k(θ|x)
r(θ, x) = .
M (x) g(θ|x)

2. Draw independently u from U[0, 1]



⎨ θ if u ≤ r(θ, x)
θ(j) = .
⎩ new trial otherwise

Let us explain why the rejection algorithm works. Set I = 1 if θ is accepted


and I = 0 otherwise. An updated θ has the distribution g(θ|x, I = 1). For
π(θ|x) = c0 (x)k(θ|x), we have
 
1
P(I = 1) = r(θ, x)g(θ|x) dθ = k(θ|x) dθ
M (x)

1 1
= π(θ|x) dθ = ,
c0 (x)M (x) c0 (x)M (x)
so that
P(I = 1|θ)
g(θ|x, I = 1) = = r(θ, x)g(θ|x)c0 (x)M (x)
P(I = 1)
k(θ|x)
= M (x)g(θ|x)c0 (x)
g(θ|x)M (x)
= k(θ|x)c0 (x) = π(θ|x).

One criticism to rejection method is that it generates “useless” simula-


tions when rejecting. The probability of the proposal to be accepted is
256 COMPUTATIONAL TECHNIQUES

Figure 9.8: Rejection algorithm. The girl changes the fortune area on the wheel
depending on the chosen ball.

(c0 (x)M (x))−1 , that is why the choice of a small M (x) fulfilling (9.8) is es-
sential for an effective algorithm.
We refer to Zwanzig and Mahjani (2020, Section 1.3), Lui (2001, Section 2.2),
Robert and Casella (2010, Section 2.3), and Givens and Hoeting (2005, Section
6.2.3). Here we continue with Examples 9.3 and 9.4.

Example 9.6 (Rejection Algorithm)


The data and estimation problem are given in Example 9.3. The goal is to
sample from the posterior InvGamma(a1 , b1 ), where a1 = 9 and b1 = 22.26.
For comparison, we apply the same trial as in Example 9.5, Gamma(6, 2) with
density g(θ). Condition (9.8) is fulfilled for M = 1.6, which is relatively high,
since target and trial have different modes; see Figure 9.9. The calculations
are given in the following R code, where the acceptance probability is cal-
π(θ|x)
culated by r(θ, x) = 1.6 g(θ) . We generate 1000 observations by the rejection
algorithm. On the right side of Figure 9.9 we can see that it is a sample from
the target InvGamma(a1 , b1 ) and not from the trial. 2
MARKOV CHAIN MONTE CARLO (MCMC) 257

0.6
0.6

0.5
target
0.5

generated
trial
target
trial*M
trial

0.4
0.4

0.3
0.3

0.2
0.2

0.1
0.1

0.0
0.0

0 2 4 6 8 0 2 4 6 8

theta theta

Figure 9.9: Example 9.6. The trial distribution is Gamma(6, 2). The target is
InvGamma(9, 22.26). Left: The target is covered with M = 1.6, which is large since
trial and target have different modes. Right: The histogram of the random numbers
generated by the rejection algorithm and the density of the target and trial. The
generated sample belongs to the target.

R Code 9.3.15. Rejection Algorithm, Example 9.6.

library(invgamma)
n<-10; MLE<-2.051922 # sample size, sufficient statistic
a1<-a0+n/2; b1<-b0+MLE*n/2 # posterior parameters
rand.rej<-function(N,M)
{rand<-rep(0,N)
for( i in 1:N)
{
L<-TRUE
while(L){
rand[i]<-rgamma(1,6,2)
r<-dinvgamma(rand[i],a1,b1)/(dgamma(rand[i],6,2)*M)
if(runif(1)<r){L<-FALSE}
}
}
return(rand)
}
R<-rand.rej(1000,1.6)

9.4 Markov Chain Monte Carlo (MCMC)


Generating an i.i.d sample from the posterior can be an inefficient task; es-
pecially when the parameter is high dimensional. The way out is to generate
a sequence where each member is based on the previous member, i.e., to
258 COMPUTATIONAL TECHNIQUES
generate a Markov chain with posterior π(θ|x) as stationary distribution. In
this section we explain the main principle of two types of these methods:
Metropolis–Hastings algorithm and Gibbs sampling.
The Metropolis–Hastings algorithm has the posterior as target and generates
new proposals by a trial distribution depending on previous member. The
simulation from the wrong distribution is corrected by a test.
The Gibbs sampler explores the hierarchical structure of Bayes models, where
the posterior can be factorized by conditional distributions. The new value
is generated recursively conditioned on the previous member. A test is not
needed.
The common principle of MCMC is briefly formulated as follows: instead of
the law of large numbers for i.i.d. samples we apply the ergodic properties of
Markov chains.
We consider the problem of determining

μ(x) = h(θ, x)π(θ|x) dθ. (9.9)

The general principle of Markov Chain Monte Carlo Methods can be formu-
lated as follows.

Algorithm 9.6 General MCMC

Start with any configuration θ(0) .


1. Sample a Markov Chain
θ(1) , . . . , θ(N ) ,
with stationary distribution π(θ|x) and with actual transition probability

A(ϑ, θ) = P θ(j) = θ | θ(j−1) = ϑ ,

such that 
π(θ|x) = A(ϑ, θ)π(ϑ|x) dϑ. (9.10)

2. After a burn-in period of length m approximate μ(x) by

1 N
μ(x) = h(θ(j) , x).
N − m j=m+1

Let us explain, why MCMC works; for more details see Lui (2001, p. 249 and
pp. 269). Suppose the Markov chain θ(1) , . . . , θ(N ) is irreducible and aperiodic.
P
Then for any initial distribution of θ(0) it holds that μ(x) −→ μ(x) for N → ∞
MARKOV CHAIN MONTE CARLO (MCMC) 259
and furthermore
√ ∞
N (μ(x) − μ(x)) 
→ N(0, 1), where σh2 (x) = σ 2 (x)(1 + 2 ρj (x)),
σh (x) j=1

with
σ 2 (x) = Var(h(θ(1) , x)), ρj (x)σ 2 (x) = Cov(h(θ(1) , x), h(θ(j+1) , x)).
Thus
1
μ(x) = μ(x) + √ σh (x)OP (1). (9.11)
N
The approximation in (9.11) is better for smaller σh (x). We search for a
Markov chain which can be easily generated by using the previous members,
but has still sufficiently low dependence structure.

9.4.1 Metropolis–Hastings Algorithms


Note that, in literature the name MCMC is often used for Metropolis–Hastings
algorithms only. The goal is to generate a Markov chain with stationary dis-
tribution π(θ|x). This problem is solved by the Metropolis algorithm, first
published in Metropolis et al. (1953), later generalized by Hastings (1970).
The Metropolis–Hastings algorithm is a trial and error procedure. The trial
step generates a proposal ϑ from T (θ(j−1) , ϑ) = p(ϑ|θ(j−1) ). The error step
carries out a test, to decide whether ϑ fits in the Markov chain or not. In case
it fits, the new member of the chain is the proposal, θ(j) = ϑ; otherwise the
chain gets stuck and we set θ(j) = θ(j−1) .
Set π(θ|x) = c0 (x)k(θ|x). The c0 (x) is not needed in the algorithm; the test
depends on the trial distribution and on the kernel k(θ|x) only.
The main steps of Metropolis–Hastings algorithm are given as follows.

Algorithm 9.7 Metropolis–Hastings


Given current state θ(j−1) .
1. Draw ϑ from T (θ(j−1) , ·).
2. Calculate the Metropolis–Hastings ratio

k(ϑ|x) T (ϑ, θ(j−1) )


R(θ(j−1) , ϑ) = .
k(θ(j−1) |x) T (θ(j−1) , ϑ)

3. Generate u from U[0, 1]. Update



⎨ ϑ if u ≤ min(1, R(θ(j−1) , ϑ))
θ(j) = .
⎩ θ(j−1) otherwise
260 COMPUTATIONAL TECHNIQUES
Let us explain, why this algorithm produces a Markov chain with A(ϑ, ϑ ),
such that (9.10) holds. Note that, the actual transition function is not equiv-
alent to the trial distribution. It is sufficient for (9.10) to show the balance
equality, i.e.,
π(ϑ|x)A(ϑ, ϑ ) = π(ϑ |x)A(ϑ , ϑ), (9.12)

since A(ϑ , ϑ)dϑ = 1 and
 
π(ϑ|x)A(ϑ, ϑ ) dϑ = π(ϑ |x)A(ϑ , ϑ)dϑ

= π(ϑ |x) A(ϑ , ϑ)dϑ

= π(ϑ |x).

It remains to show (9.12). We assume that the trial function gives always a
new proposal: T (ϑ, ϑ) = 0. That implies that the chain only gets stuck when
the proposal is rejected. Let δϑ (ϑ ) be the Dirac mass, i.e., δϑ (ϑ ) = 1 for
ϑ = ϑ and δϑ (ϑ ) = 0 otherwise. The proposal is accepted with probability
r(ϑ, ϑ ) = min(1, R(ϑ, ϑ )). Thus

A(ϑ, ϑ ) = P(“coming from ϑ to ϑ ”),


= P(“coming from ϑ to ϑ ”)(1 − δϑ (ϑ ))
+ P(“coming from ϑ to ϑ ”)δϑ (ϑ )
= P(“ϑ is proposed and accepted”)(1 − δϑ (ϑ ))
+ P(“all rejected proposals”)δϑ (ϑ ).

The proposal is independent of the test, such that

P(“ϑ is proposed and accepted”) = P(“ϑ is proposed”)P(“ϑ is accepted”)


= T (ϑ, ϑ )r(ϑ, ϑ ).
 
Using T (ϑ, ϑ )dϑ = 1 and setting r(ϑ) = T (ϑ, ϑ )r(ϑ, ϑ ) dϑ , we get

P(“all not accepted proposals”) = T (ϑ, ϑ )(1 − r(ϑ, ϑ )) dϑ = 1 − r(ϑ).

Thus
A(ϑ, ϑ ) = T (ϑ, ϑ )r(ϑ, ϑ )(1 − δϑ (ϑ )) + (1 − r(ϑ))δϑ (ϑ ).
Since T (ϑ, ϑ )δϑ (ϑ ) = T (ϑ, ϑ) = 0, we have

A(ϑ, ϑ ) = T (ϑ, ϑ )r(ϑ, ϑ ) + (1 − r(ϑ))δϑ (ϑ ) = A1 (ϑ, ϑ ) + A2 (ϑ, ϑ ).


MARKOV CHAIN MONTE CARLO (MCMC) 261

We get the symmetry in ϑ and ϑ, since

π(ϑ|x)A1 (ϑ, ϑ ) = π(ϑ|x)T (ϑ, ϑ )r(ϑ, ϑ )


π(ϑ |x) T (ϑ , ϑ)
= π(ϑ|x)T (ϑ, ϑ ) min(1,
π(ϑ|x) T (ϑ, ϑ )
= min(π(ϑ|x)T (ϑ, ϑ ), π(ϑ |x)T (ϑ , ϑ))


= π(ϑ |x)A1 (ϑ , ϑ)

and

π(ϑ|x)A2 (ϑ, ϑ ) = π(ϑ|x)(1 − r(ϑ))δϑ (ϑ ) = π(ϑ |x)A2 (ϑ , ϑ).

Hence (9.12) holds.


The Markov chains have a “burn-in” time m, the waiting time until the equi-
librium is reached. To determine m, it is useful to run the algorithm with
different starting values and determine the time when the chains reach the
same area; see Figure 9.11.
The key point is to find a good proposal distribution. A good balance is needed
between low correlation between the consecutive members and an easy way of
calculating the new member from the previous one, because high correlation
can give a bad approximation; see (9.11). Also, often stuck up chains are not
useful; see Figure 9.12.
There are two methods which can be helpful for solving this conflict: an-
nealing and thinning. An annealing procedure works as follows. We carry
out Metropolis–Hastings algorithms with several trial distributions and select
the procedure with best properties; for instance let the variances of the trial
stepwise increase. The thinning method takes a subsequence of the generated
chain, then the correlation between members of the new chain is less.
There exists a huge literature on MCMC; we recommend Zwanzig and Mah-
jani (2020, Section 1.3), Albert (2005, Chapter 6), Chen et al. (2002, Chapter
2), Lui (2001, Chapter 5), Robert and Casella (2010, Chapter 6), Givens and
Hoeting (2005, Section 7.1), and the references therein. The number of meth-
ods is rapidly increasing. Generally, our advice is:

Be careful in implementing an MCMC procedure!

Here we present only the algorithm, where the proposal distribution is a ran-
dom walk. The previous member is disturbed by a random variable with sym-
metric distribution around zero. The step length from one member of the
chain to the following is regulated by the variance of the disturbance; see
Figure 9.12.
262 COMPUTATIONAL TECHNIQUES

Algorithm 9.8 Random-walk Metropolis


Given current state θ(j−1) .
1. Draw ε from gσ , where gσ is symmetric around zero and σ is a scaling
parameter. Set
ϑ = θ(j−1) + ε.
2. Draw u from U[0, 1]. Update

⎨ ϑ if u ≤ min(1, k(θk(ϑ|x)
(j−1) |x) )
θ(j) = .
⎩ θ(j−1) otherwise

The random-walk algorithm is a special case of the Metropolis–Hastings algo-


rithm. For instance set ε ∼ Np (0, σ 2 Ip ), then ϑ|θ(j−1) ∼ N(θ(j−1) , σ 2 Ip ) and
θ(j−1) |ϑ ∼ N(ϑ, σ 2 Ip ). Both trials have the same density, T (ϑ, ϑ ) = T (ϑ , ϑ),
thus the trial distribution is reduced in the Metropolis–Hastings ratio. Note
that, the original Metropolis algorithm requires the symmetry of T (ϑ, ϑ );
Hastings impact was the generalization to non-symmetrical trials, hence the
random-walk method above is a Metropolis algorithm.
The following example is just an illustration of a random-walk Metropolis
algorithm with the uniform disturbance distribution U[−d, d] × U[−d, d] in a
simple linear regression with normally distributed errors, known error variance
and Cauchy priors for the intercept and slope.

Example 9.7 (Random-Walk metropolis)


Consider a simple linear regression yi = θ1 + xi θ2 + εi with εi ∼ N(0, 0.5) and
n = 8. The design points xi , i = 1, . . . , 8, are equidistant between 0 and 1.4.
The data points are plotted in Figure 9.10. We have θ = (θ1 , θ2 ). As Cauchy
priors we set θ1 ∼ C(0, 1.5) and θ2 ∼ C(1, 2) independently, plotted in Figure
9.10. The trial distribution is a random walk. Suppose the current state is
(j−1) (j−1)
θ(j−1) = (θ1 , θ2 ). Then the proposal ϑ = (ϑ1 , ϑ2 ) is generated by
(j−1) (j−1)
ϑ1 = θ1 + u1 and ϑ2 = θ2 + u2

where u1 and u2 are independently U[−d, d] distributed. The tuning parameter


is d; see Figure 9.12 for different choices. The Metropolis–Hastings ratio is

R(ϑ, θ(j−1) ) = min(1, LRT(ϑ, θ(j−1) )PRT(ϑ, θ(j−1) )),

where LRT is the likelihood ratio and PRT is the prior ratio, here the quotient
of Cauchy densities
(ϑ|y) π(ϑ)
LRT(ϑ, θ) = , PRT(ϑ, θ) = .
(θ|y) π(θ)
MARKOV CHAIN MONTE CARLO (MCMC) 263
We see in Figure 9.11 that after m = 120 the equilibrium is reached. We take
d = 0.8 and N = 1000, and obtain for the Bayes estimates

θ1,Bayes = 0.726 and θ2,Bayes = 2.363.

The least squares estimates are θ1,lse = 0.338 and θ2,lse = 2.969. The true
values are θ = (1, 2). Observe, that the sample size n = 8 is quite small. The
following R code gives the calculations. 2

R Code 9.4.16. Random-Walk Metropolis, Example 9.7.

MCMC<-function(d,seed1,seed2,N)
{
rand1<-rep(0,N); rand2<-rep(0,N)
LRT<-rep(1,N); PRT<-rep(1,N); R<-rep(1,N)
rand1[1]<-seed1; rand2[1]<-seed2
for ( i in 2:N)
{
rand1[i]<-rand1[i-1]+runif(1,-d,d)# proposal intercept
rand2[i]<-rand2[i-1]+rnorm(1,-d,d)# proposal slope
A<-prod(dnorm(Y,rand1[i]+rand2[i]*x,sqrt(0.5)))# likelihood
B<-prod(dnorm(Y,rand1[i-1]+rand2[i-1]*x,sqrt(0.5)))
LRT[i]<-A/B # new/old # likelihood ratio
p1<-dcauchy(rand1[i],1,2)*dcauchy(rand1[i],0,1.5)
p2<-dcauchy(rand1[i-1],1,2)*dcauchy(rand1[i],0,1.5)
PRT[i]<-p1/p2 # new/old # prior ratio
R[i]<- LRT[i]*PRT[i] # Metropolis ratio
r<-min(1,R[i])
u<-runif(1)
if (u<r){rand1[i]<-rand1[i]}else{rand1[i]<-rand1[i-1]}
if (u<r){rand2[i]<-rand2[i]}else{rand2[i]<-rand2[i-1]}
}
return(data.frame(rand1,rand2,LRT,PRT,R))
}
MM<-MCMC(0.8,0,0,1000) # carry out MCMC
mean(MM$rand1[121:1000]) # Bayes estimate intercept
mean(MM$rand2[121:1000]) # Bayes estimate slope

9.4.2 Gibbs Sampling


Example 2.19 on page 25 illustrates hierarchical Bayes modelling. The authors
in Dupuis (1995) apply Gibbs sampling.
We explain the general principle. Suppose that the parameter can be written
as
θ = (θ1 , . . . , θp ) ,
264 COMPUTATIONAL TECHNIQUES

0.25
6


true intercept
5

0.20
lse slope

4

0.15
3

prior
y

0.10
2





1

0.05
0

0.00
−1

0.0 0.5 1.0 1.5 −5 0 5 10

Figure 9.10: Example 9.7. Left: Generated data points and the true regression line.
The broken line is the least squares fit. Right: The prior distribution for the intercept
is a Cauchy distribution with location parameter 0, and scale parameter 1.5; the prior
for the slope is a Cauchy distribution with location parameter 1 and scale parameter
2.

intercept
4
3

slope
2
2

slope

0
1

−2
0

−4

0 50 100 150 200 250 300 0 50 100 150 200 250 300

1:300 1:300

Figure 9.11: Example 9.7. Left: The generated Markov chains with d = 0.8 and seed
= (0.0). Right: Different start values for studying the burn-in time.

where the θi ’s are either one- or multidimensional. Moreover, suppose that we


can simulate the corresponding conditional posterior distributions π1 , . . . , πp ;
that is we can simulate
 
θi | (x, θ[−i] ) ∼ πi . | x, θ[−i] , where θ[−i] = (θ1 , . . . , θi−1 , θi+1 , . . . , θp ).

Thus Gibbs sampling reduces the generation of high-dimensional values to


low-dimensional ones. The densities π1 , . . . , πp are called full conditionals -
only they are used for simulation.
MARKOV CHAIN MONTE CARLO (MCMC) 265

d=0.2
3

0.0 0.2 0.4 0.6 0.8 1.0


d=0.8 d=2
d=2
2

1.0
0 10 20 30 40 50 60
d=0.8
1

0.6
0.2
0.0 0.2 0.4 0.6 0.8 1.0 −0.2
0

0 10 20 30 40 50 60
d=0.2

0 100 200 300 400 500

1:500 0 10 20 30 40 50 60

Figure 9.12: Example 9.7. Right: The generated Markov chain for the intercept
with d = 0.2 which is too small and with d = 2 which is too large. Left: The
autocorrelation functions for different turning parameter d.

First we consider the random-scan algorithm, where in every step one com-
ponent is randomly chosen and updated, while the other components remain
unchanged.

Algorithm 9.9 Random-Scan Gibbs Sampler


(t)
Given current state θ(t) = (θi )i=1,...,p .
1. Randomly select a coordinate i from 1, . . . , p according to the probability
distribution p1 , . . . , pp on 1, . . . , p.
(t+1) (t)
2. Draw θi from πi . | x, θ[−i] .
(t+1) (t)
3. Set θ[−i] = θ[−i] .

Alternatively, we can also change the components one after the other.

Algorithm 9.10 Systematic-Scan Gibbs Sampler


(t) (t+1)
Given current state θ(t) = (θi )i=1,...,p . For i = 1, . . . , p, draw θi from

(t+1) (t+1) (t)


πi . | x, θ1 , . . . , θi−1 , θi+1 , . . . , θp(t) .

Let us explain why the Gibbs sampler produces a “right” Markov chain. Let
i = it be the index chosen at step t. Then the actual transition probability
266 COMPUTATIONAL TECHNIQUES

from ϑ to ϑ at step t is

⎨ πi θ(t) = ϑ | θ(t−1) = ϑ[−i] for ϑ, ϑ , with ϑ[−i] = ϑ[−i]
At (ϑ, ϑ ) =
i i [−i]
.
⎩ 0 otherwise

The chain is non-stationary, but aperiodic and irreducible for π(θ|x) > 0 for
all θ. 
Moreover ϑ At (ϑ, ϑ )π(ϑ|x) = π(ϑ |x), since for Si = {ϑ : ϑ[−i] = ϑ[−i] }
 
At (ϑ, ϑ )π(ϑ|x) = = ϑi | θ[−i] = ϑ[−i] π(ϑ|x)
(t) (t−1)
πi θ i
ϑ ϑ∈Si

= ϑi | = ϑ[−i] πi (ϑ[−i] |x) πi ϑi | ϑ[−i]
(t) (t−1)
= πi θi θ[−i]
ϑ∈Si

ϑi ϑ[−i] πi (ϑ[−i] |x) = π(ϑ |x).


(t) (t−1)
= πi θi = | θ[−i] =

For more details see Zwanzig and Mahjani (2020, Section 2.2.5), Albert (2005,
Chapter 10), Chen et al. (2002, Section 2.1), Lui (2001, Chapter 6), Robert
and Casella (2010, Chapter 7), Givens and Hoeting (2005, Section 7.2), and
the references therein.
Here we give two examples for illustration. The first example explores the
simulated data from Example 9.7.

Example 9.8 (Systematic-Scan Gibbs sampler)


The data is given in Example 9.7 and plotted in Figure 9.10. In contrast to Ex-
ample 9.7 we suppose that the variance is unknown, such that θ = (θ1 , θ2 , θ3 ),
where the first component is the intercept, the second the slope, and the
third the variance. We assume the conjugate prior, θ ∼ NIG(a0 , b0 , β0 , Σ0 )
with a0 = 2, b0 = 2 β0 = (0, 4)T , and Σ0 = diag(1, 2). Applying Theorem
6.5 on page 150, and using the R code (see below) we obtain the posterior,
θ|y ∼ NIG(a1 , b1 , β1 , Σ1 ), with a1 = 10, b1 = 6.995 and
⎡ ⎤ ⎡ ⎤
0.128 0.259 −0.237
β1 = ⎣ ⎦ , Σ1 = ⎣ ⎦.
3.246 −0.237 0.382

The goal is to generate θ(1) , . . . , θ(N ) with approximatively θ(j) ∼


NIG(a1 , b1 , β1 , Σ1 ) by using Gibbs sampling. The complete conditionals fol-
low as
π(θ|y) = π1 ((θ1 , θ2 )|y, θ3 )π2 (θ3 |y) = π11 (θ1 |y, θ2 , θ3 ) π12 (θ2 |y, θ3 )π2 (θ3 |y).
It holds for bivariate normal distribution, (θ1 , θ2 )|y, θ3 ∼ N2 (β1 , θ3 Σ1 ), that
σ12 σ2
θ1 |y, θ2 , θ3 ∼ N(β1,1 + (θ2 − β2,1 ), θ3 (σ11 − 12 )), θ2 |y, θ3 ∼ N(β2,1 , σ22 )
σ22 σ22
MARKOV CHAIN MONTE CARLO (MCMC) 267
where β1 = (β1,1 , β2,1 ) and Σ1 = (σij )i=1,2;j=1,2 . Further θ3 |y ∼
InvGamma(a1 /2, b1 /2). Here π11 (θ1 |y, θ2 , θ3 ) is N(2.147−0.622θ2 , θ3 0.111) and
π12 (θ2 |y, θ3 ) is N (3.246, θ3 0.382). Then Gibbs sampling is carried out as fol-
lows:
(j)
1. Generate θ3 from InvGamma(5, 3.498).
(j) (j)
2. Generate θ2 from N(3.246, θ3 0.382).
(j) (j) (j)
3. Generate θ1 from N(2.147 − 0.622θ2 , θ3 0.111).
After 1000 runs we take the average and obtain as approximation for the
Bayes estimates α = 0.129, β = 3.25 and σ 2 = 0.839. The Bayes estimates in
2
this set up are αBayes = 0.128, βBayes = 3.247, σBayes = 0.874. 2

R Code 9.4.17. Systematic-Scan Gibbs Sampler, Example 9.8.

xx2<-seq(0,1.5,0.2); n<-length(xx2) # data


Y<-c(0.727, 1.898, 1.110, 1.648, 1.420, 2.919, 3.993, 5.613)
xx1<-rep(1,n); X11<-sum(xx1*xx1); X12<-sum(xx2*x1)
X22<-sum(xx2*xx2); XX<-matrix(c(X11,X12,X12,X22),ncol=2)
S0<-matrix(c(1,0,0,2),ncol=2); B0<-c(0,4); a0<-2; b0<-2 #prior
SS<-solve(XX+solve(S0)) # posterior
bb<-as.numeric(coef(M))# LSE
B1<-SS%*%(XX%*%bb+solve(S0)%*%B0) # Bayes estimator
a1<-a0+n # posterior
r<-Y-B1[1]*xx1-xx2*B1[2] # Bayes residuum
b1<-b0+sum(r^2)+t(B1-B0)%*%solve(S0)%*%(B1-B0)# posterior
b1/(a1-2)# Bayes variance estimate
### Gibbs sampler
v11<-SS[1,1]-SS[1,2]*SS[1,2]/SS[2,2]
aa1<-B1[1]-SS[1,2]/SS[2,2]*B1[2]
bb1<-SS[1,2]/SS[2,2]
library(invgamma)
gibbs<-function(N)
{
aa<-rep(0,N)# intercept
bb<-rep(0,N)# slope
ss<-rep(1,N) # variance
for( i in 1:N)
{
ss[i]<-rinvgamma(1,a1/2,b1/2)
bb[i]<-rnorm(1,B1[2],sqrt(ss[i]*SS[2,2]))
s<-ss[i]*v11; aa[i]<-rnorm(1,aa1+bb1*bb[i],sqrt(s))
}
return(data.frame(aa,bb,ss))
}
G<-gibbs(1000) # Carry out the sampler
268 COMPUTATIONAL TECHNIQUES
mean(G$aa[100:1000]) # estimate intercept
mean(G$bb[100:1000]) # estimate slope
mean(G$ss[100:1000]) # estimate variance
The next example gives an idea for data augmentation. We are interested
in generating random numbers from the beta-binomial distribution, X ∼
BetaBin(n, α, β). We apply a Bayes model and obtain the beta-binomial dis-
tribution as marginal distribution of X from (X, θ).

Example 9.9 (Beta-binomial distribution)


Consider X|θ ∼ Bin(n, θ) and θ ∼ Beta(α, β), then θ|x ∼ Beta(α+x, β+n−x);
see Example 2.11 on page 16. Also X ∼ BetaBin(α, β); see Example 3.6 on
page 35.
Here we generate the joint distribution of (X, θ) using the conditional distribu-
tion of θ|X and of X|θ. The marginal distribution of θ is the prior distribution
Beta(α, β), while the marginal distribution of X is a beta-binomial distribu-
tion with  
n B(α + k, β + n − k)
P(X = k) = .
k B(α, β)
Algorithm: Given the current state (x(t) , θ(t) ) .
1. Draw θ(t+1) from Beta(α + x(t) , n − x(t) + β).
2. Draw x(t+1) from Bin(n, θ(t+1) ).
Using R code 9.4.18, N = 5000 values are simulated and compared with the
true distributions for α = β = 0.5 in Figure 9.13. 2

R Code 9.4.18. Gibbs sampler of beta-binomial distribution, Example 9.9

gibbs.beta<-function(a,b,n,N)
{
theta=rep(0,N);x=rep(0,N);
theta[1]<-rbeta(1,a,b);
x[1]<-rbinom(1,n,theta[1]);
for(i in 2:N){
theta[i]<-rbeta(1,a+x[i-1],b+n-x[i-1]);
x[i]<-rbinom(1,n,theta[i])
}
return(data.frame(x,theta))
}

9.5 Approximative Bayesian Computation (ABC)


The aim of ABC is to generate an i.i.d. sample θ(1) , . . . , θ(N ) from the posterior
distribution π(θ|x). The ABC methods are recommended when the likelihood
APPROXIMATIVE BAYESIAN COMPUTATION (ABC) 269

2.0
0.18
simulated
true

1.5
0.14

Density

1.0
0.10

0.5
0.0
0.06

0.0 0.2 0.4 0.6 0.8 1.0

0 2 4 6 8 10 theta

Figure 9.13: Example 9.9. The values are generated by the Gibbs sampler, using R
code 18. Left: BetaBin(10, 0.5, 0.5) Right: Beta(0.5, 0.5)

is intractable, i.e., there is no closed form likelihood expression or it is compu-


tationally expensive. Theoretically it is possible to apply ABC in all situations
where we are able to generate new data for a given parameter.
With ABC we have a very general tool which allows Bayesian calculations for
almost all combinations of likelihood and prior. Nevertheless we have to
- check that the used model is reasonable and
- test that the algorithm is well-implemented.
Both items are not easy to fulfill. We explain here only the main underlying
idea and give two illustrative examples.
The main idea is intuitive and convincing.
1. Generate θnew from the prior π(·).
2. Simulate new data xnew from p(·|θnew ).
3. Compare the new data xnew with the observed data x. If there is “no
difference”, accept the parameter θnew ; otherwise reject and go back to
Step 1.
The accepted θ’s are independent random variables and have approximately
the distribution π(θ|x). The crucial point is the test of “no difference”. The
acceptance rate of ABC can be very low.
Assume that S = (S1 , S2 , ..., Sp ) is a sufficient statistic in model P = {Pθ :
θ ∈ Θ}, where Pθ is the data generating distribution.
270 COMPUTATIONAL TECHNIQUES

Algorithm 9.11 ABC


Assume the Bayes model {P, π}. Given the data x.
1. Generate θ from the prior π.
2. Simulate xnew from Pθ and compute the sufficient statistic S(xnew ).
3. Calculate D = d(S(xnew ), S(x)), where d is a metric.
4. Accept θ if D ≤ q.
5. Return to 1.

The problem is to determine the distance d(S(xnew ), S(x)) and the threshold
q. For small threshold q the rejection rate is high and the algorithm is slow.
On the other hand for large thresholds the generated sample follows the prior
more than the posterior; see Figure 9.14.
In the following we present an example for illustration. It is the same set up
as in Example 9.3, on page 246, with data x are given in R code 11. This
example has the advantage that we can compare the ABC simulated sample
with the known posterior distribution.

Example 9.10 (ABC) The data x consists of n = 10 i.i.d. observations


2
from N(0,
σn ). The parameter of interest is θ = σ 2 . A sufficient statistic is
2
S(x) = i=1 xi = 2.05. The prior is InvGamma(α0 , β0 ) with α0 = 4 and
β0 = 12. The posterior distribution is InvGamma(α1 , β1 ) with α1 = 9 and
β1 = 22.25; see Example 9.3.
1. Generate θ from InvGamma(4, 12).
2. Generate an i.i.d. sample z = (z1 , . . . , z1 0) from N(0, θ).
3. Calculate S(z) and D = |S(z) − S(x)|
4. For D < q we accept θ; otherwise we go back to Step 1.
The following R code gives the procedure. The results for N = 1000 and dif-
ferent thresholds q are presented in Figure 9.14. 2

R Code 9.5.19. ABC, Example 9.10.

x<-c(-0.06,2.43,1.55,1.76,1.27,0.94,0.48,1.16,1.32,1.81)# data
S<-sum(x*x) # maximum likelihood estimate, sufficient statistic
library(invgamma)
a0<-4; b0<-12 # prior parameters
a1<-a0+n/2; b1<-b0+S/2 # posterior parameters
## ABC ##
APPROXIMATIVE BAYESIAN COMPUTATION (ABC) 271
0.5

0.5
0.4

0.4
generated
target generated
trial target
Density

Density
0.3

0.3
trial
0.2

0.2
0.1

0.1
0.0

0.0
0 2 4 6 8 10 0 2 4 6 8 10

Figure 9.14: Example 9.10. The histogram of the random numbers generated
by the ABC algorithm. The continuous line is the target, the posterior density
InvGamma(9, 22.25); the dotted line is the trial, the prior InvGamma(4, 12). Left: For
the threshold q = 0.1 the generated sample is approximatively distributed like the
target. Right: For the large threshold q = 10 the generated sample is approxima-
tively distributed like the trial.

ABC.dist<-function(N,q,S)
{
rand<-rep(0,N)
for( i in 1:N)
{
L<-TRUE
while(L){
rand[i]<-rinvgamma(1,a0,b0) # draw theta
xx<-rnorm(n,0,sqrt(rand[i])) # new data
SS<-sum(xx*xx) # new sufficient statistics
D<-abs(SS-S) # metric
if(D<q){L<-FALSE}
}
}
return(rand)
}
C<-ABC.dist(1000,0.1,2.05) # carry out ABC
# Figure
hist(C,freq=FALSE,xlab="",ylim=c(0,0.6),main="",nclass=14)
xx<-seq(0.01,15,0.1); lines(xx,dinvgamma(xx,a1,b1),lty=2,lwd=2)
box(lty=1,col=1)
The following example is an illustration of the ABC method when the likeli-
hood function is not given in a closed form and a data generating algorithm
is used.
272 COMPUTATIONAL TECHNIQUES
Example 9.11 (ABC)
Consider the following latent model

yi = f (xi , θ) + εi , i = 1, . . . , n

with θ = (a, b, c) and f (x) = ax + bx2 + cx3 , and εi , i = 1, . . . , n are indepen-


dent standard normal. The yi are unobserved and only truncated values are
given. We observe


⎪ x + 1 if yi > xi + 1,

⎨ i
zi = yi if xi − 1 ≤ yi ≤ xi + 1



⎩ xi − 1 if yi < xi − 1,

The zi have no standard probability distribution, but for given θ we can


generate new data points. In Step 3 of the algorithm we use the squared
distance between the observed data z = (z1 , . . . , zn ) and the proposed data
points z = (z1 , . . . , zn ), i.e.,


n
d(z, z ) = (zi − zi )2 .
i=1

The priors are a ∼ N(−1, 1), b ∼ N(1, 1), and c ∼ N(0, 1), mutually indepen-
dent. The fitted Bayes curve f˜(x) is calculated from θ(1) , . . . , θ(N ) by

1  (j)
N
θ̃ = θ = (ã, b̃, c̃), f˜(x) = ãx + b̃x2 + c̃x3 .
N j=1

The left side of Figure 9.15 shows the true curve and the non-truncated points.
On the right side, data points together with filter region are plotted. In Fig-
ure 9.16 the true and a fitted Bayes curves are compared. The algorithm with
q = 3 delivers a good fit but the choice q = 13 gives bad fit. 2

R Code 9.5.20. ABC, Example 9.11

# data generation procedure


Data<-function(a,b,c,){
xx<-seq(0,5,0.5); ff<-a*xx+b*xx^2+c*xx^3 # regression curve
ynew<-ff+rnorm(n,0,1)# new data
for( i in 1:n){if(ynew[i]>xxx[i]+1){ynew[i]<-xxx[i]+1}}
for( i in 1:n){if(ynew[i]<xxx[i]-1){ynew[i]<-xxx[i]-1}}
return(ynew)
}
# ABC algorithm
ABC<-function(N,q){
APPROXIMATIVE BAYESIAN COMPUTATION (ABC) 273

6

6


4

4
● ●


2

2

0

0



−2

−2
0 1 2 3 4 5 0 1 2 3 4 5

Figure 9.15: Example 9.11. Left: The latent model. The observations are simulated
with a = −2, b = 2 and c = 0.3. Right: The observed model. Observations inside
the gray area are unchanged; otherwise the point on the borderline is taken.

randa<-rep(0,N); randb<-rep(0,N);randc<-rep(0,N)
for( i in 1:N)
{
L<-TRUE
while(L){
a<-rnorm(1,-1,1) # priors
b<-rnorm(1,1,1); c<-rnorm(1,0,1)
D<-Data(a,b,c)
Diff<-sum((yobs-D)*(yobs-D))
if(Diff<q){randa[i]<-a;randb[i]<-b;randc[i]<-c;L<-FALSE}
}
}
rand<-data.frame(randa,randb,randc)
return(rand)
}
AA<-ABC(100,3) # carry out ABC
ma<-mean(AA$randa); mb<-mean(AA$randb) # Bayes estimates
mc<-mean(AA$randc)
There are many different ABC methods. We refer to Beaumont (2019) and
the references therein.
We mention only one more. The combination of MCMC and ABC avoids
generation from the prior; instead MCMC steps are included. Note that, this
algorithm produces a Markov chain so that the elements in the sequence are
no longer independent.
274 COMPUTATIONAL TECHNIQUES
6

6
4

4
2

2
true
0

0
q=3
q=13
−2

−2
0 1 2 3 4 5 0 1 2 3 4 5

Figure 9.16: Example 9.11 with N = 100. Left: The broken line is the Bayes curve,
calculated with q = 3. Right: The ABC procedure delivers a reasonable fit for q = 3
and a bad fit for q = 13.

Algorithm 9.12 ABC–MCMC


Given current parameter θ(j−1) .
1. Generate ϑ from the trial distribution T (θ(j−1) , ·).
2. Simulate x from p(·|ϑ).
3. Compare x with x. In case of a good fit continue, otherwise go back to
Step 1.
4. Calculate the Metropolis–Hastings ratio

π(ϑ)T (ϑ, θ(j−1) )


R(θ(j−1) , ϑ) = .
π(θ(j−1) )T (θ(j−1) , ϑ)

5. Generate u from U[0, 1]. Update


⎧  
⎨ ϑ if u ≤ min 1, R(θ(j−1) , ϑ)
(j)
θ = .
⎩ θ(j−1) otherwise

9.6 Variational Inference (VI)


Given a Bayes model {P, π}, the family of posterior distributions is deter-
mined, by π(θ|x) ∝ π(θ)(θ|x). In case this posterior family is difficult to
handle, it can be useful to approximate it by another family. In this context,
a model D for the posterior is posed and the best approximation of π(θ|x) in
D is determined. This is the idea of variational Bayes inference.
VARIATIONAL INFERENCE (VI) 275
It is especially applicable when the likelihood does not belong to an exponen-
tial family, where we have the same family for prior and posterior and only
need to update the parameter; see Section 3.2, page 38.
The Kullback–Leibler divergence K(Q|P) measures how much Q deviates from
P. It is defined by   
q(θ)
K(Q|P) = q(θ) ln dθ, (9.13)
p(θ)
where K(Q|P) ≥ 0, with = only for Q = P. Using the Kullback–Leibler diver-
gence as measure for proximity we want K(Q|P) to be small. We call q ∗ (θ|x)
variational density iff

q ∗ (θ|x) = arg min K(q(θ|x)|π(θ|x)). (9.14)


q∈D

The model D is named variational family. The idea is to use q ∗ (θ|x) instead of
π(θ|x) and to explore properties of D. The challenge is to find a compromise
between a simple model D, which gives a bad approximation and a complex
model D with high quality approximation; otherwise, for complex models it
is hard to find q ∗ (θ|x).
We refer to Lee (2022) and Bishop (2006) for more details. Here we explain
only the main approach.
Set Θ = Θ1 × Θj × Θm ⊂ Rp . We focus on the mean-field variational family,
DMF , which requires that the components of θ are independent:

m
q(θ|x) = qj (θj |x), θj ∈ Θj , (9.15)
j=1

where qj (θj |x) is called the jth variational factor.


We discuss the criterion (9.14) and give an illustrative example where we can
compute both q ∗ (θ|x) and π(θ|x).
First, we reformulate criterion (9.14). Note that, p(θ, x) = p(x)π(θ|x), such
that
  
q(θ|x)
K(q(θ|x)|π(θ|x)) = q(θ|x) ln dθ
π(θ|x)
  
q(θ|x)p(x)
= q(θ|x) ln dθ
p(θ, x)
   
q(θ|x)
= q(θ|x) ln(p(x)) dθ + q(θ|x) ln dθ
p(θ, x)
= ln(p(x)) − ELBO(q)

where
 
ELBO(q) = q(θ|x) ln(p(θ, x)) dθ − q(θ|x) ln(q(θ|x)) dθ.
276 COMPUTATIONAL TECHNIQUES
Since K(q(θ|x)|π(θ|x) ≥ 0 it holds that

ELBO(q) ≤ ln(p(x)).

This explains the notation ELBO, it stands for evidence lower bound. Mini-
mization of the Kullback–Leibler divergence means maximization of ELBO.
Let us illustrate the concept of variational inference for a linear model with
normal errors and known error variance.

Example 9.12 (Variational inference in linear Bayes model)


We assume
y|β ∼ Nn (Xβ, Σ) and β ∼ Np (γ0 , Γ0 ).
Applying Corollary 6.1 on page 135, we have the posterior β|y ∼ Np (γ1 , Γ1 )
with
γ1 = Γ1 (XT Σ−1 y + Γ0 −1 γ),
Γ1 = (Γ0 −1 + XT Σ−1 X)−1 .
Assume for the posterior the following mean-field variational family,

D = {Np (γ, Γ) : γ ∈ Rp , Γ = diag(λ1 , . . . , λp ); λi > 0}.

The Kullback–Leibler divergence for multivariate normal distributions is


K(Np (γ, Γ)|Np (γ1 , Γ1 ))
 
1 −1 T −1 det(Γ1 ) (9.16)
= tr(Γ1 Γ) − p + (γ1 − γ) Γ1 (γ1 − γ) + ln( ) .
2 det(Γ)
For γ1 = γ we have
 
1 det(Γ1 )
K(Np (γ, Γ)|Np (γ, Γ1 )) = tr(Γ−1
1 Γ) − p + ln( ) .
2 det(Γ)

Furthermore, for Γ1 = Γ, tr(Γ−1 1 Γ) = tr(Ip ) = p and det(Γ1 ) = det(Γ), such


that K(Np (γ, Γ)|Np (γ, Γ)) = 0. Thus for orthogonal design XT X = Ip and for
Σ = In we have minγ,Γ K(Np (γ, Γ)|Np (γ1 , Γ1 )) = 0. The posterior belongs to
D and q ∗ (β|y) = π(β|y). In the general case we choose γ = γ1 but we still
have to solve

p
min tr(Γ−1
1 diag(λ1 , . . . , λp )) − ln( λi ) . (9.17)
{λ1 ,...,λp ;λi >0}
i=1

Denote the diagonal elements of Γ0 −1 + XT Σ−1 X by aii . Then



p 
p 
p
min aii λi − ln(λi ) = min(aii λi − ln(λi ))
{λ1 ,...,λp ;λi >0} λi
i=1 i=1 i=1
p
= (aii λ∗i − ln(λ∗i )),
i=1
VARIATIONAL INFERENCE (VI) 277

4.5
4.0
0.3
0.2
0.5 0.4
3.5 0.8
0.7
0.8
slope

3.0

1
2.5

0.6 0.6

0.4
2.0

0.2

0.1
1.5

0.0 0.5 1.0 1.5 2.0

intercept

Figure 9.17: Example 9.12. The grey contours are from the posterior N(γ1 , Γ1 ),
the black contours belong to the variational distribution N(γ1 , Γ∗ ). Observe that,
the components of the variational distribution are independent, but have different
variances.

where λ∗i = a−1 ∗ ∗ ∗ ∗


ii . We get N(γ1 , Γ ) with Γ = diag(λ1 , . . . , λp ) as variational

distribution q (θ|x).
Consider simple linear regression with 11 equidistant design points
0, 0.1, 0.2, . . . , 1, independent standard normal errors, and prior N(0, I2 ). Then
(see Figure 9.17)
⎛ ⎞ ⎛ ⎞
12 5.5 0.083 0
Γ−11 =
⎝ ⎠ , Γ∗ = ⎝ ⎠.
5.5 4.85 0 0.206

The following result on partial optimization gives the theoretical background


for Algorithm 9.13, which approximates the variational distribution q ∗ (θ|x)
by stepwise conditioning.
278 COMPUTATIONAL TECHNIQUES

Theorem 9.2 (Variational factor) Assume the Bayes model {P, π}


with posterior π(θ|x) and mean-field variational family m DMF . Set, for
q ∈ DMF , q(θ|x) = qk (θk |x)q(θ[−k] ) with q(θ[−k] ) = j=1,j=k qj (θj |x) and
θ[−k] = (θ1 , . . . , θk−1 , θk+1 , . . . , θm ). Then, for

qk∗ (θk |x) = arg min K(q(θ|x)|π(θ|x)),


qk

it holds that
 
qk∗ (θk |x) ∝ exp q(θ[−k] ) ln(π(θk |θ[−k] , x)) dθ[−k] . (9.18)

Proof: We write K(q(θ|x)|π(θ|x)) = K1 − K2 . Using q(θ|x) = qk (θk |x)


q(θ[−k] ),

K1 = q(θ|x) ln(q(θ|x)) dθ

= qk (θk |x)q(θ[−k] ) ln(qk (θk |x)q(θ[−k] )) dθ
 
= qk (θk |x)q(θ[−k] ) ln(qk (θk |x)) dθ + qk (θk |x)q(θ[−k] ) ln(q(θ[−k] )) dθ
 
= qk (θk |x) ln(qk (θk |x)) dθk + q(θ[−k] ) ln(q(θ[−k] )) dθ[−k]

= qk (θk |x) ln(qk (θk |x)) dθk + constK1

Using additionally π(θ|x) = π(θk |θ[−k] , x) π(θ[−k] |x) and setting


 
ln(vk (θk )) = q(θ[−k] ) ln(π(θk |θ[−k] , x))dθ[−k] , cv = vk (θk )dθk ,

we have

K2 = qk (θk |x)q(θ[−k] ) ln(π(θ|x)) dθ
 
= qk (θk |x)q(θ[−k] ) ln(π(θk |θ[−k] , x))dθ + qk (θk |x)q(θ[−k] ) ln(π(θ[−k] |x))dθ
 
= qk (θk |x) ln(vk (θk )) dθk + q(θ[−k] ) ln(π(θ[−k] |x)) dθ[−k]

vk (θk )
= qk (θk |x) ln( ) dθk + constK2
cv
Summarizing
vk (θk )
K(q(θ|x)|π(θ|x)) = K(qk (θk |x)| ) + const.
cv
VARIATIONAL INFERENCE (VI) 279
vk (θk )
Since K(Q|P) = 0 for Q = P, the minimizer is cv .
2
Applying Theorem 9.2 we formulate the coordinate ascent variational infer-
ence (CAVI) algorithm.

Algorithm 9.13 CAVI m


Given the current state q(θ|x)(t) = j=1 qj (θj |x)(t) . For k = 1, . . . , m,
calculate
 
qk (θk |x) (t+1)
∝ exp q[−j] (θ|x) ln(π(θk |θ[−k] , x)) dθ[−k]

where

k−1 
m
q[−j] (θ|x) = qj (θj |x)(t+1) qj (θj |x)(t) .
j=1 j=k+1

We specify the CAVI algorithm for an i.i.d sample x1 , . . . , xn from N(μ, σ 2 ).

Example 9.13 (CAVI algorithm)


We apply the set up of Example 7.3 on page 185, where we have θ = (μ, σ 2 ),
the prior NIG(α0 , β0 , μ0 , σ02 ), and the posterior NIG(α1 , β1 , μ1 , σ12 ), given in
(7.1), with
σ1−2 = σ0−2 + n, μ1 = σ12 (nx̄ + σ0−2 μ0 ),
1 1
α1 = α0 + n, β1 = β0 + (n − 1)s2 + 2 μ20 + nx̄2 − 2 μ21 .
σ0 σ1
InvGamma(α1 , β1 , μ1 , σ12 ) implies the factorization
α1 β1
μ|σ 2 , x ∼ N(μ1 , σ 2 σ12 ), σ 2 |x ∼ InvGamma( , ).
2 2
Applying Lemma 6.5 on page 149, we get π(μ|x). The distribution of
π(σ 2 |μ, x) is obtained by following arguments. Define zi = xi − μ. Then
z1 , . . . , zn are i.i.d. from N(0, σ 2 ) with prior σ 2 ∼ InvGamma( α20 , β20 ). In
Example2.14, on page 19, we derived σ 2 |z ∼ InvGamma( α21 , β12(μ) ), with
n
β1 (μ) = i=1 zi2 + β0 . Thus we have the factorization
α1 2 α1 β1 (μ)
μ|x ∼ t(α1 , μ1 , σ1 ), σ 2 |μ, x ∼ InvGamma( , ),
β1 2 2
n
i=1 (xi − μ) + β0 . Applying Theorem 9.2, we obtain the
2
where β1 (μ) =
algorithm: Given the current state q(μ|x)(t) q(σ 2 |x)(t) calculate
 
q(μ|x) (t+1)
∝ exp q(σ |x) ln(π(μ|σ , x)) dσ
2 (t) 2 2
280 COMPUTATIONAL TECHNIQUES
 
q(σ 2 |x)(t+1) ∝ exp q(μ|x)(t+1) ln(π(σ 2 |μ, x)) dμ .

Especially a, we have with σ 2 (t) = ( q(σ 2 |x)(t) (σ 2 )−1 dσ 2 )−1 , that q(μ|x)(t+1)
is N(μ1 , σ12 σ 2 (t)). Further, for
 
n
β1 (t) = q(μ|x)(t) β1 (μ) dμ = (xi − μ1 )2 + nσ12 σ 2 (t) + β0 ,
i=1

we have that q(σ 2 |x)(t) is InvGamma( α21 , β12(t) ). It holds because


 
1
q(σ 2 |x)(t) ln(π(μ|σ 2 , x)) dσ 2 = q(σ 2 |x)(t) (− 2 2 (μ − μ1 )2 ) dσ 2 + const
2σ σ1

and
 
β1 (μ)
q(μ|x) ln(π(σ|μ, x)) dμ = q(μ|x)(t) ( 2 ) dμ − (α1 + 1) ln(σ 2 ) + const.
(t)
σ
2

9.7 List of Problems


1. Consider the stroke data given in the following table. We assume that the
treatment and control samples are independent.

success size
treatment 119 11037
control 98 11034

Let p1 be the probability of success (to get a stroke) under the treatment
and p2 the probability of success in the control group. We are interested in
θ = pp12 , especially in testing H0 : θ ≤ 1.
(a) Which distribution model is valid?
(b) Give the Jeffreys prior for p1 and p2 .
(c) Give the least favourable prior for p1 and p2 .
(d) Derive the posterior distributions for p1 and p2 in (b) and (c).
(e) Give an R code for generating a sample of size N from the posterior
distributions of θ.
(f) In Figure 9.18 the density estimators of the posterior distributions of θ
are plotted. Can we reject H0 : θ ≤ 1?
2. Here are R codes for three different methods for calculating the same inte-
gral. Method 1
LIST OF PROBLEMS 281
density.default(x = odds.ratio)

3.5
Jeffreys

3.0
least favourable
2.5
2.0
Density

1.5
1.0
0.5
0.0

0.8 1.0 1.2 1.4 1.6 1.8 2.0

Figure 9.18: Problem 1(f).

x1<-runif(10000);
sum(dbeta(x1,1,20)*dbinom(15,1000,x1))/10000
Method 2
x2<-rbeta(10000,1,20); sum(dbinom(15,1000,x2))/10000
Method 3
f1<-function(x){dbinom(15,1000,x)*dbeta(x,1,20)}
integrate(f1,0,1)
Following results are obtained:

Method 1 2 3
Result 0.01228 0.01397 0.01476

(a) Which integral is calculated?


(b) Give the name of each method.
(c) Give the steps of the algorithm of Method 2.
(d) Why are the results so different? Which result would you prefer the most,
and why?
(e) Compare the accuracies of Methods 1 and 2.
282 COMPUTATIONAL TECHNIQUES
3. Consider the following R code:
rand<-function(N,M){R<-rep(NA,N);
for(i in 1:N){ L<-TRUE;
while(L){R[i]<-rcauchy(1,C);
r<-prod(dcauchy(x,R[i]))*dnorm(R[i],2,1)/(M*dcauchy(R[i],C));
if(runif(1)<r){L<-FALSE}}}; return(R)}
M<-5; C<-3
R1<-rand(10000,M); mean(R1)
(a) Which Bayesian model is considered? Determine π(θ) and p(x|θ).
(b) Which method is carried out?
(c) What is generated?
(d) Give the main steps of the algorithm.
4. Consider n independent observations (x, y) = {(x1 , y1 ), . . . , (xn , yn )} from
a logistic regression model

exp(θxi )
P (yi = 1|θ) = , i = 1, . . . , n.
1 + exp(θxi )

The prior π(θ) is a Cauchy distribution. Given the data set (x, y):

x 0.2 0.6 0.8 1 1.5 2 2.3 3 3.2 3.5


y 1 0 0 1 0 1 0 1 1 1

In order to carry out a test the user is interested in the integral



μ= π(θ|x, y) dθ.
θ<0.2

(a) Apply a random-walk Metropolis–Hastings algorithm to calculate μ.


Consider the walk

ϑ = θj + a × u, u ∼ U[−1, 1].

Specify π(θ), (θ|x, y) and the trial distribution T (θj , θ).


(b) Write an R code for this MCMC random-walk procedure.
(c) Make a reasonable choice for a and for the burn-time.
(d) Carry out the procedure and estimate the integral.
5. We are interested in a Gibbs sampler from
⎛ ⎛ ⎞⎞
1 ρ
(X, Y ) ∼ N2 ⎝0, ⎝ ⎠⎠ . (9.19)
ρ 1

(a) Give the main steps of a systematic-scan Gibbs sampler.


LIST OF PROBLEMS 283
(b) Give an R code.
(c) Is (9.19) the stationary distribution of the generated Markov chain?
6. Consider the following R code:
library(MASS); library(tmvtnorm); library(truncnorm)
data(iris); # Iris data set,
V1<-iris[1:50,1] # length of sepal of Setosa
V2<-iris[1:50,2] # width of sepal of Setosa
data<-data.frame(V1,V2);
m_1<-mean(V1);m_2<-mean(V2);
abc<-function(N,tol)
{
rand1<-rep(NA,N);rand2<-rep(NA,N);rand3<-rep(NA,N);
rand4<-rep(NA,N);rand5<-rep(NA,N);
for(i in 1:N)
{ L<-TRUE;
while(L)
{rand1[i]<-rtruncnorm(1,0,Inf,3,1);
rand2[i]<-rtruncnorm(1,0,Inf,2,1);
rand3[i]<-runif(1,0.02,2); rand4[i]<-runif(1,0.02,2);
rand5[i]<-runif(1,0,0.9);
S11<-rand3[i]^2; S12<-rand5[i]*rand3[i]*rand4[i];
S22>-rand4[i]^2;
S<-matrix(c(S11,S12,S12,S22),2,2);
a<-c(0,0); b<-c(Inf,Inf);
Z<-rtmvnorm(50,c(rand1[i],rand2[i]),S,a,b);
D<-(mean(Z[,1])-m1)^2+(mean(Z[,2])-m2)^2;
if(D<tol){L<-FALSE}
}
};
D<-data.frame(rand1,rand2,rand3,rand4,rand5);return(D)
}
(a) Formulate the parametric statistical model used for generating the data.
(b) Which priors are applied?
(c) What random sample is generated?
(d) Give the main steps of the algorithm.
Chapter 10

Solutions

This chapter provides solutions to the problems listed at the end of each
chapter. In some cases, main arguments or sketch of solution are given.

10.1 Solutions for Chapter 2


1. Students exam results: The posterior probabilities π(θ|x) are given as fol-
lows.
x D C B A
θ=0 2/5 1/34 1/28 0
θ=1 2/5 30/34 12/28 1/3
θ=2 0 3/34 15/28 2/3
2. Sufficient statistic: To show that π(θ|x) = π(θ|T (x)) iff T (x) is a suf-
ficient statistic. It holds that T (x) is a sufficient statistic iff the fac-
torization criterion holds; see Liero and Zwanzig (2011, page 54), i.e.,
p(x|θ) = g(T (x)|θ)h(x). We get
π(θ|x) ∝ p(x|θ)π(θ)
∝ g(T (x)|θ)h(x)π(θ)
∝ g(T (x)|θ)π(θ) ∝ π(θ|T (x)).
3. Find a prior: It should hold a.s. for all x ∈ X and all θ ∈ Θ, that the prior
is independent of x, π(θ) ∝ π(θ|x)
(θ|x) . Here θ = n and
   
θ 1 θ θ + x + 1 1 θ+x
(θ, x) ∝ ( ) and π(θ|x) ∝ ( ) .
x 2 x 2
It holds that θ+x+1
π(θ|x)
∝ xθ  .
(θ|x) x
This quotient is not independent of x. Set θ = 1. Especially, for x = 0 the
quotient equals 1, but for x = 1 the quotient is 3. There exists no prior
such that the posterior of θ = n is negative binomial.

DOI: 10.1201/9781003221623-10 284


SOLUTIONS FOR CHAPTER 2 285
4. Posterior for precision parameter: In Example 2.13, set n = 4 and α = β =
2. We obtain
1 2
4
τ |x ∼ Gamma(4, 2 + x ).
2 i=1 i
5. Two alternative models:
(a) Posterior in model M0 : p|(X = 15) ∼ Beta(16, 1005); see Example 2.11.
Posterior in model M1 : λ|(X = 15) ∼ Gamma(35, 2), since

λk
π(λ|X = k) ∝ λα−1 exp(−λβ) exp(−λ)
k!
∝ λα+k−1 exp(−λ(β + 1).

(b) Following the law of rare events the difference between P0 =


Bin(1000, 0.05) and P1 = Poi(20) is relatively small, since we set Ep =
1
20 = 0.05 in the first model and Eλ = 20 in the second. The first two
prior moments in model M0 are Ep = 20 1
= 0.05, Varp = 0.002 the pos-
terior moments are E(p|X = 15) = 0.016, Var(p|X = 15) ≈ 0. The first
two prior moments in model M1 are Eλ = 20, Varλ = 20 the posterior
moments are E(λ|X = 15) = 17.5, Var(λ|X = 15) = 8.75. Have in mind
that the parameter of interest is θ = p in M0 and θ = λ/1000 in M1 the
posterior expectations of θ are 0.016 and 0.017. The posterior variances
of θ are very small in both models.
(c) In both models the prior variances of θ are very small, hence there is a
risk of dominating priors. A clear recommendation is not possible.
6. Linear regression: We calculate the posterior

π(β|y
∝ π(β)(β|y)
1 
2 n
∝ exp(− (βk − 1)2 ) exp(−2 (yi − β0 − β1 xi − β2 zi )2 )
2 i=1
k=0

1  
2
∝ exp(− (βk − 1)2 ) − 2n (β0 − ȳ)2 + (β1 − xy)2 + (β2 − zy)2 )
2
k=0

1
n 1
n 1
n
with ȳ = n i=1 yi , xy = n i=1 xi yi , zy = n i=1 zi yi . Completing the
squares

ac + b 2 (ac + b)2
c(x − a)2 + (x − b)2 = (c + 1)(x − ) − + ca2 + b2 ,
c+1 c+1
we obtain
4n + 1  
π(β|y) ∝ exp(− (β0 − b0 )2 + (β1 − b1 )2 + (β2 − b2 )2 )
2
286 SOLUTIONS
with
4nȳ + 1 4nxy + 1 4nzy + 1
b0 = , b1 = , b2 = .
4n + 1 4n + 1 4n + 1
Thus posteriors are independent normal distributions:
1 1 1
β0 |y ∼ N(b0 , ), β1 |y ∼ N(b1 , ), β2 |y ∼ N(b2 , ).
4n + 1 4n + 1 4n + 1
n n
The independence is a consequence of
 i=1 xi = 0, i=1 zi = 0 and
n
i=1 xi zi = 0.
7. Hierarchical model:
(a) The joint distribution of (θ, μ) is normal, thus the marginal distribution
of θ is Nn (E(θ), Cov(θ)) with E(θ) = E (E(θ|μ)) = 0. Set 1n = (1, . . . , 1)T ,
for an n × 1 vector of 1s. Then Eθ = μ1. Applying (2.5),
Cov(θ) = Eμ (Cov(θ|μ)) + Covμ (E(θ|μ)) = In + 1n 1Tn .
(b) Recall that the posterior depends on the sufficient statistics only. The
sufficient statistic is
1 1
k k
1
X=( X1j , . . . , Xnj ), X|θ ∼ Nn (θ, In )
k j=1 k j=1 k

Thus π(θ|x) = π(θ|x), where


   
1 T T −1 k
π(θ|x) ∝ exp − θ (In + 1n 1n ) θ exp − (x − θ) (x − θ) .
T
2 2
Set b1 = (A + B)−1 Bb. Using
θT Aθ + (θ − b)T B(θ − b) = (θ − b1 )T (A + B)(θ − b1 ) + const
with A = (In + 1n 1Tn )−1 , B = kIn , and b = X, and using the relation
c
(In + c1n 1Tn )−1 = (In − 1n 1Tn )
cn + 1
we get
1
A + B = (k + 1)In − 1n 1Tn
n+1
and
1 1
(A + B)−1 = (In + 1n 1Tn ).
k+1 k(n + 1) + 1
Summarizing, the posterior is Nn (b1 , Σ) with b1 = kΣx and
1 1
Σ= (In + 1n 1Tn ).
k+1 k(n + 1) + 1
1 T
(c) We have θ̄ = n 1n θ thus

1 
n k
n+1 1 n+1
θ̄|x ∼ N(m, v ), m =
2
xij , v 2 = .
k(n + 1) + 1 n k i=1 j=1 n k(n + 1) + 1
SOLUTIONS FOR CHAPTER 3 287
10.2 Solutions for Chapter 3
1. Costumer satisfaction:
(a) Set the prior π(0) = 0.3, π(1) = 0.5, π(2) = 0.2. The posterior probabil-
ities π(θ|x) are given as follows.

x + +/− − 0
θ=2 12/17 1/9 0 20/121
θ=1 5/17 5/9 50/287 100/121
θ=0 0 3/9 237/287 1/121

(b) Highest entropy prior is π(0) = π(1) = π(2) = 1/3. The posterior is the
normalized likelihood.
(c) Set π(0) = p, π(1) = 2p, π(2) = 1 − 3p. Using (3.25),
1
H(p) = −p ln(p) − 2p ln(2p) − (1 − 3p) ln(1 − 3p), arg max H(p) = 2 .
p 3 + 23

(d) Set π(0) = π(1) = p, π(2) = 1 − 2p and Eθ = p + 2(1 − 2p) = 1. Thus


p = 11/30.
2. Log-normal distribution:
(a) Cauchy distribution C(m, γ) is symmetric around m. Take m = 3.
(b) Distribution function of C(3, γ) is
 
1 1 θ−3 7
F (θ) = + arctan , F (10) = 0.7, γ = = 9.63.
2 π γ tan(0.2π)

(c) The posterior is not Cauchy distribution since the posterior moments
exist.
(d) Fisher information is σ12 . Thus πJeff (θ) ∝ 1.
3. Pareto distribution:
(a) No. The support depends on μ.
(b) No. The support depends on μ.
(c) Yes, with natural parameter α and


n
1
T (x) = − xi , Ψ(α) = − ln(α), because f (x, α) = α exp(−αx).
i=1
x

(d) By Theorem 3.3, the conjugate prior is

π(α|μ, λ) ∝ exp(αμ + λ ln(α).

Set μ = −b. Then α ∼ Gamma(λ + 1, b), b > 0, λ > 0.


288 SOLUTIONS
4. Geometric distribution:
(a) It belongs to an exponential family with natural parameter η = ln(1−θ),
P(X = k|θ) = exp (k ln(1 − θ) + ln(θ)) ,
T (k) = k, Ψ(η) = − ln(1 − exp(η)).
(b) By Theorem 3.3, we get the conjugate prior for η as
πη (η|μ, λ) ∝ exp (ημ − λΨ(η)) .
Transform η = h(θ) = ln(1 − θ). Then
πθ (θ|μ, λ) =πη (h(θ)|μ, λ)|h (θ)|
1
∝ exp (ln(1 − θ)μ + λ ln(θ)) ∝ (1 − θ)μ−1 θλ .
1−θ
(c) θ ∼ Beta(λ + 1, μ)
n
(d) θ | x1 , . . . , xn ∼ Beta(λ + n + 1, μ + i=1 xi )
(e) Fisher information: ln(p(k|θ)) = k ln(1 − θ) + ln(θ) and
d 1 1 k 1
ln(p(k|θ)) = −k + , I(θ) = Var( )= .
dθ 1−θ θ 1−θ (1 − θ)θ2

(f) πJeff (θ) ∝ θ−1 (1 − θ)− 2


1

(g) Jeffreys prior is improper.


5. Gamma distribution:
The likelihood is
n
β α α−1
(α, β|x) = x exp(−βxi )
i=1
Γ(α) i
(10.1)
β nα  −1  
n n n
= x exp(α ln(x i ) − β xi ).
Γ(α)n i=1 i i=1 i=1

(a) The sample is generated by a 2-dimensional exponential family of form


the (3.17) with natural parameter θ = (α, β) and
T (x) = (ln(x), −x), Ψ(θ) = ln(Γ(α)) − α ln(β).
(b) Applying Theorem 3.3 a conjugate prior is π(θ) ∝ exp(θT μ − λΨ(θ)).
(c) The posterior is π(θ|x) ∝ exp(θT (T (x) + μ) − (λ + 1)Ψ(θ)).
(d) Consider the conjugate family, π(θ) = π(β|α)π(α). Setting b = −μ2 > 0,
π(β|α) ∝ exp(−βb + λα ln(β)) ∝ β αλ exp(−βb),
i.e., β|α ∼ Gamma(λα + 1, b). The marginal distribution of α,
π(α) ∝ Γ(α)−λ exp(αμ1 ),
does not belong to an exponential family. It is not a “well known” dis-
tribution either.
SOLUTIONS FOR CHAPTER 3 289
θ
6. Binomial distribution Bin(n, θ) and odds ratio η = 1−θ :

(a) Apply π(θ) = πη (h(θ))| h(θ)


dθ | with h(θ) =
θ
1−θ . We get

1 1 1−θ 1 1
π(θ) ∝ ∝ ∝ .
h(θ) (1 − θ) 2 θ (1 − θ) 2 θ(1 − θ)

(b) The distribution in (a) is called Haldane distribution. It is improper. For


x ∈ {1, . . . , n − 1},

π(θ|x) ∝ θ−1 (1 − θ)−1 θx (1 − θ)n−x ∝ θx−1 (1 − θ)n−x−1 ∝ θa (1 − θ)b ,

with a ≥ 0, b ≥ 0 is proper.
(c) Applying π(ξ) = πθ (h(ξ))| h(ξ)
dξ | with

ξ dh(ξ) 1 b
θ = h(ξ) = , = , c= ,
c+ξ dξ (c + ξ)2 a

gives
 a−1  b−1
ξ c 1 ξ a−1
π(ξ) ∝ 2
∝ .
c+ξ c+ξ (c + ξ) (b + aξ)a+b

It is the kernel of an F-distribution with (2a, 2b) degrees of freedom.


7. Binomial distribution Bin(n, θ) and natural parameter :
θ
(a) Yes, it is. Natural parameter: ϑ = ln(η) = ln( 1−θ ) and

p(x|θ) ∝ θx (1−θ)n−x ∝ exp(x ln(θ)+(n−x) ln(1−θ)) ∝ exp(ϑx−Ψ(ϑ))

with
 
exp(ϑ) 1
θ= , Ψ(ϑ) = −n ln(1 − θ) = −n ln .
1 + exp(ϑ) 1 + exp(ϑ)

(b) Applying Theorem 3.3, for λ > 0, μ ∈ R, the conjugate prior is

π(ϑ) ∝ exp (ϑμ − λn ln(1 + exp(ϑ)))


exp(ϑ)ν exp(ϑ)a
∝ ∝
(1 + exp(ϑ))nλ (1 + exp(ϑ))a+b

with a = μ and b = nλ − μ.
(c) Applying Theorem 3.3,

exp(ϑ)ν+x exp(ϑ)a1
π(ϑ|x) ∝ ∝
(1 + exp(ϑ))n(λ+1) (1 + exp(ϑ))a1 +b1

with a1 = a + x, b1 = b + n − x.
290 SOLUTIONS
(d) We shift the natural parameter to obtain a known family. Set
 
1 b θ 1 b
z = ln = C + ϑ, ϑ = 2z − 2C = h(z), C = ln( ).
2 a1−θ 2 a

Applying π(z) = πϑ (h(z))| h(z)


dz | with
d h(z)
dz = 2, exp(h(z)) = a
b exp(2z),

exp(z)2a exp(z)2a
π(z) ∝ ∝ .
( ab exp(2z) + 1)a+b (2a exp(2z) + 2b)a+b

It is the kernel of Fisher’s z-distribution with f1 = 2a, f2 = 2b degrees


of freedom. This distribution is defined as 12 ln(F ), where F ∼ Ff1 ,f2 .
See the solution of Problem 6 above.
8. Multinomial distribution: Jeffreys prior for θ = (p1 , p2 ) is given in (3.45).
1 ηj
πJeff (θ) ∝ , θj = = hj (η), j = 1, 2
(θ1 θ2 )1/2 1 + η1 + η2
⎛ ⎞
 
∂hi (η) 1 1 + η2 −1
J= = ⎝ ⎠.
∂ηj i=1,2,j=1,2 (1 + η1 + η2 )2 −1 1 + η1
Applying (3.27)
η1 + η 2 + η 1 η2
πJeff (η) = πJeff (h(η))|J| ∝ 1/2
. (10.2)
(η1 η2 (1 + η1 + η2 )5 )

9. Hardy–Weinberg model with natural parametrization:


(a) Yes. Set Ik (x) = 1 for x = k and zero else. Set x = 0 for aa, x = 1 for Aa,
x = 2 for AA. Set Ik (x) = 1 for x = k and zero else, I0 (x)+I1 (x)+I2 (x) =
1. For x ∈ {0, 1, 2} and T (x) = 2I0 (x) + I1 (x) = 2 − I1 (x) + 2I2 (x).

P(x|θ) =θ2I0 (x) (2θ(1 − θ))I1 (x) ((1 − θ))2I2 (x)


=θ2I0 (x)+I1 (x) (1 − θ)I1 (x)+2I2 (x) 2I1 (x)
 
θ
= exp T (x) ln( ) + 2 ln(1 − θ) 2I1 (x) .
1−θ

θ
(b) Natural parameter: η = ln( 1−θ ), sufficient statistic: T (x) = 2I0 (x) +
I1 (x), and Ψ0 (θ) = −2 ln(1 − θ). It has form (3.17).
exp(η)
(c) We have θ = h(η) = 1+exp(η) . Apply Theorem 3.3 with Ψ(η) =
Ψ0 (h(θ)) = 2 ln(1 + exp(η)). A conjugate prior is

exp(η)μ
π(η|μ, λ) ∝ exp(ημ − 2λ ln(1 + exp(η)) ∝ .
(1 + exp(η))2λ
SOLUTIONS FOR CHAPTER 3 291
(d) Theorem 3.3 delivers the posterior

exp(η)μ+T (x)
π(η|x) = π(η|μ + T (x), λ + 1) ∝ .
(1 + exp(η))2(λ+1)

(e) Calculate the Fisher information I(η) = Var(V (η|x)), with score function
V (η|x) = ln(P(x|η)), as

exp(η) exp(η)
V (η|x) = T (x) − 2 , Var(T (x)) = 2 .
1 + exp(η) (1 + exp(η))2

Jeffreys prior is given as


1
1 exp(η) 2
πJeff (η) ∝ |I(η)| ∝
2 .
1 + exp(η)

(f) Jeffreys prior belongs to the conjugate family with μ = λ = 12 .


10. Reference prior :
(a) See Examples 3.21 and 3.24: πJeff (θ) ∝ const and
1 1
πJeff (θ|x) = √ exp(− (x − θ)2 ).
2π 2

(b) For θ ∈ [−k, k], πJeff


k
(θ) = 1
2k , the posterior is the truncated normal
distribution,
1 1
k
πJeff (θ|x) = c(k)−1 √ exp(− (x − θ)2 ), c(k) = Φ(k − x) − Φ(−k − x).
2π 2

k
(c) The Kullback–Leibler divergence K(πJeff (.|x)|πJeff (.|x)) equals
 k
1 1
c(k)−1 √ exp(− (x − θ)2 )(− log(c(k)))dθ = − ln(c(k))
−k 2π 2

so that

lim K(π(.|x)|π k (.|x)) = − ln( lim c(k)) = − ln(1) = 0.


k→∞ k→∞

k
(d) The prior πJeff (θ) is independent of θ, thus p0 (θ) = 1.
11. Shannon entropy of N (μ, σ 2 ): We have
 
1 1
H(N(μ, σ 2 )) = − ϕμ,σ2 (x) ln( √ )dx + 2 ϕμ,σ2 (x)(x − μ)2 dx
2πσ 2σ
√ 1 1 1
= ln( 2πσ) + 2 σ 2 = (ln(2πσ) + 1) = ln(2πσ 2 e).
2σ 2 2
292 SOLUTIONS
12. Compare two normal priors:
2
(a) A sufficient statistic is x̄ ∼ N(θ, σn ); see Example 2.12. For j = 1, 2, the
2
posterior distributions related to πj are N(μj,1 , σj,1 ) with

2 σ02 σ 2 2 λσ02 σ 2
σ1,1 = , σ2,1 = .
nσ02 + σ 2 nλσ02 + σ 2

(b) Use (3.49) and the result of Problem 11 above. Note that the Shannon
entropy of N(μ, σ 2 ) is independent of the mean.
1 1 1 1
I(P n , π1 ) = ln(2πσ02 e) − ln(2πσ1,1
2
e) = ln(nσ02 + σ 2 ) − ln(σ 2 )
2 2 2 2
and
1 1
I(P n , π2 ) = ln(nλσ02 + σ 2 ) − ln(σ 2 ).
2 2
(c) We have  
1 nσ02 + σ 2
I(P , π1 ) − I(P , π2 ) = ln
n n
,
2 nλσ02 + σ 2
nσ02 + σ 2 σ02 + σ 2 /n
lim 2 = lim = λ−1 ,
n→∞ nλσ + σ 2 n→∞ λσ 2 + σ 2 /n
0 0
1
lim (I(P n , π1 ) − I(P n , π2 )) = − ln(λ).
n→∞ 2
(d) For λ > 1 and every μ0 the second prior is less informative.

10.3 Solutions for Chapter 4


1. Query:
(a) From

π(0) p(x|0)
π(θ = 0|x) = = 1 − π(θ = 1|x),
π(0) p(x|0) + π(1)p(x|1)

it follows that

π(θ = 0|x = −1) = 1, π(θ = 0|x = 0) ∝ 0.3 p, π(θ = 0|x = 1) ∝ 0.2 p.

(b) Applying Theorem 4.5 we have three different classes of priors.

x −1 0 1
0≤p≤ 2
7 δ1 (x) 0 0 0
2
7 <p≤ 3
8 δ2 (x) 0 0 1
3
8 <p≤1 δ3 (x) 0 1 1
SOLUTIONS FOR CHAPTER 4 293
(c) The frequentist risk is calculated by R(0, δ ) = P(C1 |0), R(1, δ ) =
π π

1 − P(C1 |1), where C1 = {x : δ π = 1}. It is given as

θ 0 1
0≤p≤ 2
7 0 1
2
7 <p≤ 3
8 0.2 0.5
3
8 <p≤1 0.5 0

(d) The Bayes risk r(π) = (1 − p)R(0, δ π ) + p R(1, δ π ), given as function of


p,
case risk behavior sup
0≤p≤ 2
7 p linear increasing 0.2857
2
7 <p≤ 3
8 0.2 + 0.3 p linear increasing 0.3125
3
8 <p≤1 0.5 − 0.5 p linear decreasing 0.3125

(e) It holds that maxp r(π) = maxp (0.2 + 0.3 p) = 0.3125. The least
favourable prior is π0 (1) = 38 .

2. Binomial model :
(a) Applying Theorem 4.3 the Bayes estimator is the k1k+k 2
2
fractile of the
posterior Beta(α + x, β + n − x).
(b) See Example 4.9. For k1 = k2 the Bayes estimator is the median of
Beta(α + x, β + n − x), approximated by (4.19).
(c) The asymmetric approach is of interest, when the underestimation of the
success probability is more dangerous than overestimation, i.e., when the
success means the risk of a traffic accident.
(d) Set n = 1 and k1 = k2 . The frequentist risk is

R(θ, δ π ) = (1 − θ)L(θ, δ π (0)) + θL(θ, δ π (1)).

Exploring the inequalities for the median of Beta(α, β),


α−1 α
≤ median ≤ , for 1 < α < β
α+β−2 α+β
α−1 α
≥ median ≥ , for 1 < β < α
α+β−2 α+β
we get δ π (0) < δ π (1). Calculating R(θ, δ π ) we have three different cases:

θ < δ π (0) < δ π (1), δ π (0) < θ < δ π (1), δ π (0) < δ π (1) < θ
294 SOLUTIONS
In each case it is impossible to find a beta prior such that the frequentist
risk is constant. We consider the first case only.

R(θ, δ π ) = δ π (0) + θ(1 + δ π (1) − δ π (0)).

But 1 + δ π (1) − δ π (0) > 0 for all beta priors with α > 1, β > 1. Thus
inside {Beta(α, β) : α > 1, β > 1} we cannot find a prior, which fulfills
the sufficient condition for a minimax estimator given in Theorem 4.13.

3. Gamma distribution θ = β, Jeffreys prior :


d
(a) The loglikelihood ln (θ|x) and the score function V (θ|x) = dθ ln (θ|x)
are
α
ln (θ|x) = − ln(Γ(α)) + α ln(θ) + (α − 1) ln(x) − θx, V (θ|x) = − x.
θ
α
Using Var(X|θ) = θ2 we obtain the Fisher information and Jeffreys prior
α 1 1
I(θ) = Var(V (θ|x)) = 2
, πJeff (θ) ∝ I(θ) 2 ∝ .
θ θ
See also Example 3.13. Using (2.9), the posterior distribution is
Gamma(α, x).
(b) Applying Theorem 4.2, δ π (x) = E(θ|x) = αx .
(c) Frequentist risk: For δ π (x) = E(θ|x) and quadratic loss,
2
R(θ, δ π ) = Var(δ π (X)|θ) + (θ − E(δ π (X)|θ)) .

Note that, if X ∼ Gamma(α, β) then X −1 ∼ InvGamma(α, β). We apply


the moments of inverse-gamma distribution; see Section 11.2.
αθ α2 θ 2
E(δ π (X)|θ) = , Var(δ π (X)|θ) = ,
α−1 (α − 1)2 (α − 2)

α2 1
R(θ, δ π ) = θ2 d(α), d(α) = + .
(α − 1)2 (α − 2) (α − 1)2
Posterior expected risk:
α
ρ(π, δ π (x)|x) = E((θ − δ π (x))2 |x) = Var(θ|x) =
x2
Bayes risk:
 ∞  ∞
1
r(π) = R(θ, δ π ) dθ = d(α) θ dθ = ∞.
0 θ 0

(d) We can neither apply Theorem 4.9 nor Theorem 4.10. Since δ π is unique,
by Theorem 4.11, δ π is admissible.
SOLUTIONS FOR CHAPTER 4 295
1
4. Gamma distribution θ = β, conjugate prior :
(a) See Example 3.13. Natural parameter η = −β = − θ1 and T (x) = x.
(b) See Liero and Zwanzig (2011, Theorem 4.3). Since E(T (X)) = α β = θα,
the efficient estimator is δ ∗ (x) = αx .
(c) In Example 3.13 the conjugate prior for β is Gamma(α0 , β0 ) and the
posterior is Gamma(α0 + α, β0 + x). Thus the conjugate prior of θ = β1
is the inverse-gamma distribution InvGamma(α0 , β0 ) and the posterior is
InvGamma(α0 + α, β0 + x).
(d) Applying Theorem 4.2 and the expectation of InvGamma(α0 + α, β0 + x),
δ π (x) = α0β+α−1
0 +x
.
(e) Frequentist risk of δ π and δ ∗ :
2
R(θ, δ π ) = Var(δ π (X)|θ) + (θ − E(δ π (X)|θ))
1  2 
= θ (α + α0 ) − 2θα0 (β0 + 1) + (β02 + 1)2 ,
(α0 + α − 1) 2

1 α θ2
R(θ, δ ∗ ) = Var(δ ∗ (X)|θ) = = .
α2 β 2 α
Posterior expected risk:

(β0 + x)2
ρ(π, δ π (x)|x) = E((θ − δ π (x))2 |x) = Var(θ|x) = .
(α − 1)2 (α − 2)

Bayes risk: The frequentist risk is quadratic in θ. For α0 > 2 the second
prior moment exists, such that r(π) < ∞. Particularly
 
1 −2 2 α0 − 2
r(π) = β + β 0 + 1 .
(α0 + α + 1)2 α0 − 2 0 α0 − 1

(f) Applying Theorem 4.9, 4.10 or 4.11 we note that under α0 > 2 the Bayes
estimate is admissible.
(g) Due to admissibility of the Bayes estimator, there exists a θ with
R(θ , δ ∗ ) > R(θ , δ π ), which is no contradiction to the efficiency of δ ∗
since the Bayes estimator is biased.

5. Multinomial distribution, minimax estimator :


(a) See Example 3.8. Set x = (n1 , n2 , n3 ). The posterior is θ|x ∼ Dir(α1 +
n1 , α2 + n2 , α3 + n3 ).
(b) Set α0 = α1 + α2 + α3 and α = (α1 , α2 , α3 ). Then, by Theorem 4.12,
1
δ π (x) = (α + x) ∈ Θ.
α0 + n
296 SOLUTIONS
π
(c) Frequentist risk of δ : Using θ1 +θ2 +θ3 = 1 and the mean and covariance
of X|θ ∼ Mult(n, θ1 , θ2 , θ3 ), with
 
E(X|θ) = n θ, Cov(X|θ) = n diag(θ) − θθT
we obtain bias = E(δ π (X)|θ) − θ = 1
α0 +n (α − α0 θ) and

R(θ, δ π ) = tr Cov(δ π (X)|θ) + biasT bias


1 
3 
3 
3
= (α02 − n) θi2 − 2α0 α i θi + αi2 + n .
(α0 + n)2 i=1 i=1 i=1

(d) Bayes risk, r(π) = Eπ R(θ, δ π ), is finite since the moments of


Dir(α1 , α2 , α3 ) are finite.
(e) Least favourable prior: Applying Theorems 4.7 and 4.12 we search for
α1 , α2 , α3 such that R(θ, δ π ) = const. As R(θ, δ π ) is permutation invari-

ant with respect to α1 , α2 , α3 , we set α1 = α2 = α3 . For αi = 13 n,
n
R(θ, δ π ) = 23 (√n+n) 2.

(f) Applying Theorem 4.7 the minimax estimate is


1√
n + xi
θminimax,i = δ (x)i = 3√
π0
.
n+n

10.4 Solutions for Chapter 5


1. Consistent estimation from an i.i.d. sample of Exp (θ):
n
(a) The posterior is Gamma(α + n, β + nx̄), with x̄ = n1 i=1 xi , since

n
π(θ|x(n) ) ∝ θα−1 exp(−θβ) (θ exp(−θxi )) ∝ θα+n−1 exp(−θ(β+nx̄)).
i=1
α+n
(b) The Bayes estimator is δ(x(n) ) = E(θ|x(n) ) = β+nx̄ .
(c) Applying the law of large numbers, x̄ → EX, a.s. with EX = θ1 , so that
α
+1
δ(x(n) ) = n
β
→ θ a.s.
n + x̄
2. Kullback–Leibler divergence in exponential families:
(a) Apply (3.11) and (3.28),

p(x|θ0 )
K(Pθ0 |Pθ ) = p(x|θ0 ) ln( )dx
p(x|θ)
⎛ k ⎞
 C(θ0 ) exp j=1 ζj (θ0 )Tj (x)
= p(x|θ0 ) ln ⎝ k
⎠ dx
C(θ) exp j=1 jζ (θ)T j (x)
   k
C(θ0 )
= ln + Eθ0 Tj (X) (ζj (θ0 ) − ζj (θ)) .
C(θ) j=1
SOLUTIONS FOR CHAPTER 5 297
(b) Gamma(α, β belongs to a two parametric exponential family with nat-
ural parameters η = (α, β), sufficient statistics T (x) = (ln(x), −x), and
βα
C(α, β) = Γ(α) .
Applying EX = α β and E ln(X) = Ψ(α) − ln(β), where Ψ(α) is the
digamma function, we get for K(Gamma(α0 , β0 )|Gamma(α, β))
   
Γ(α) β0 α0
ln + α ln + Ψ(α0 )(α0 − α) − (β0 − β). (10.3)
Γ(α0 ) β β0

3. Kullback–Leibler divergence for uniform distributions: It holds that


 θ0  
1 θI(0,θ0 ) (x)
K(U(0, θ0 )|U(0, θ)) = ln dx.
0 θ0 θ0 I(0,θ) (x)

(a) For θ < θ0 ,


 θ    θ0  
1 θ 1 1
K(U(0, θ0 )|U(0, θ)) = ln dx + ln dx = ∞.
0 θ0 θ0 θ θ0 0
For θ > θ0 ,
 θ0    
1 θ θ
K(U(0, θ0 )|U(0, θ)) = ln dx = ln .
0 θ0 θ0 θ0

(b) Further K(P|P) = 0. Applying the results above, the Kullback–Leibler


divergence is only one-sided continuous.
4. Comparison of posteriors related to different priors
The posteriors are Pi = Pπni (.|x(n) ) = Gamma(nν + αi , nx̄ + β), i = 1, 2.
Applying (10.3),
 
Γ(α2 )
K(Gamma(α1 , β)|Gamma(α2 , β)) = ln + Ψ(α1 )(α1 − α2 ).
Γ(α1 )

Using the asymptotic relations for Γ(.) and digamma function Ψ(.)

Γ(n + z) Ψ(z)
lim = 1, lim 1 = 1,
n→∞ Γ(n) z n z→∞ ln(z) − 2z

we have
   
nν + α1 1
K(P1 |P2 ) = ln − (α1 − α2 ) + o(1) = o(1).
nν 2(nν + α1 )

5. Conditions of Schwartz’ Theorem: First condition (5.14): The prior density


of N(μ, σ02 ) is positive for all θ ∈ Θ = R. Furthermore from Example 4.12,
on page 96, we know that

Kε (θ0 ) = {θ : K(Pθ0 |Pθ ) < ε} = {θ : (θ0 − θ)2 < 2 ).


298 SOLUTIONS
Second condition (5.15): Let a, b, O = (a, b) and d > 0 be such that a <
θ0 < b, and [θ0 − d, θ0 + d] ⊂ (a, b). Using pn (x(n) |θ)π(θ) = π(θ|x(n) )p(x(n) )
and the posterior N(μ1 , σ12 ) given in Example 2.12, on page 16, we have for
the density of Pn,Oc

π(θ) pn (x(n) )
q(x(n) ) = pn (x(n) |θ) π c
dθ = π c (Φ(a1 ) + 1 − Φ(b1 )) ,
Oc P (O ) P (O )

where a1 = a−μ and b1 = b−μσ1 . For sufficiently large n, x̄ ∈ [θ0 −√d, θ0 +


1 1
σ1
d] ⊂ (a, b), there exist positive constants c1 , c2 such that a1 < −c1 n and
√ 2
b1 > c2 n . Using x̄ − θ0 ∼ N(0, σn ) and

1 t2 1 t2
Φ(−t) ≤ exp(− ), 1 − Φ(t) ≤ exp(− ), for t > 0,
2 2 2 2
there exists a positive constant c0 such that Φ(a1 )+1−Φ(b1 ) ≤ exp(−c0 n).
Hence for Pπ (Oc ) > C −1 , we have q(x(n) ) ≤ p(x(n) )C exp(−c0 n).
Set H = H 12 (P⊗nθ0 , Pn,O ) and p0 (x(n) ) = pn (x(n) |θ0 ) and split the integra-
c

tion region in Bd = {x(n) : |x̄ − θ0 | < d} and Bdc . Then


 -  -
H= p0 (x(n) )q(x(n) ) dx(n) + p0 (x(n) )q(x(n) ) dx(n) = H1 + H2 ,
Bd Bdc

where  -
H1 = p0 (x(n) )p(x(n) ) dx(n) < C exp(−c0 n).
Bd
2
Applying Cauchy–Schwarz inequality and Pθ0 (Bdc ) ≤ exp(− nd
2σ 2 ),

 1
2  1
2
1
H2 ≤ p0 (x(n) ) dx(n) q(x(n) ) dx(n) ≤ Pθ0 (Bdc ) 2
Bdc Bdc

2
d
For D0 > C + 1 and q0 = exp(− min( 4σ 2 , c0 )), (5.15) follows.

6. Counterexample to Theorem 5.3:


(a) Bayes estimate: Applying Theorem 6.2 with X = 1T2 = (1, 1) the pos-
terior is N (γ1 , Γ1 ) with Γ1 = I2 − λn 11T , γ1 = λn 12 x̄, and λn = 1 1+2 .
n
The Bayes estimator equals γ1 = (γ1,1 , γ1,2 ). We have x̄ → a + b, a.s.
and λn → 12 . For a > 0, b > 0 and a = b it holds that

1 1
lim γ1,1 = (a + b) = a, lim γ1,2 = (a + b) = b.
n→∞ 2 n→∞ 2
SOLUTIONS FOR CHAPTER 5 299
(b) Inconsistency of the posterior: Take θ0 = (a0 , b0 ) with a0 > 0, b0 > 0 and
a0 = b0 . Set O = {θ : ||θ − θ0 || < c} and A = {θ : |a − a0 | < c, b ∈ R}.
Then O ⊂ A and

Pπ (O|x(n) ) < Pπ (A|x(n) ) = Φ(Un ) − Φ(Ln ) → Φ(c1 ) − Φ(c2 ) = c0 < 1

where
−c − γ11 + a0 1 1
Ln =  → √ (−c + (a0 − b0 )) = c1 a.s.
(1 − λn ) 2 2

c − γ11 + a0 1 1
Un =  → √ (c + (b0 − a0 )) = c2 a.s..
(1 − λn ) 2 2

(c) Check condition (5.15): Let O be the open set defined above. Applying
pn (x(n) |θ)π(θ) = π(θ|x(n) )p(x(n) ), and setting C = 2P1−c
π (O c ) ,
0

pn (x(n) ) π c
q(x(n) ) = P (O |x(n) ) < Cpn (x(n) ).
Pπ (Oc )

Thus
H 12 (P⊗n ⊗n
θ0 , Pn,O ) > C H 12 (Pθ0 , P
c
x(n)
).

P⊗n
θ0 is Nn (μ1 , Σ1 ) and P
x(n)
is Nn (μ2 , Σ2 ) (see Theorem 6.2 on page
134), with μ1 = 1n 12 θ0 , Σ1 = In , and μ2 = 1n 1T2 γ1 , Σ2 = In + 21n 1Tn .
T

The Hellinger transform Hn = H 12 (Nn (μ1 , Σ1 ), Nn (μ2 , Σ2 )) of two mul-


tivariate normal distributions is given as
  14  
det(Σ1 ) det(Σ2 ) 1
Hn = exp − (μ1 − μ2 )T Σ−1 (μ1 − μ2 )
det(Σ)2 8

where Σ = 12 (Σ1 + Σ2 ). Applying

1
det(In + aaT ) = 1 + aT a, (In + aaT )−1 = In − aaT ,
1 + aT a
we obtain
  14  
1 + 2n 1 n
Hn = exp − 2 (a0 + b0 )2 .
(1 + n)2 8 σ (1 + n)

Hn converges to zero, but not with the required rate. It holds for all
q0 < 1, that limn→∞ q0−n Hn = limn→∞ q0−n n− 4 const = ∞.
1
300 SOLUTIONS
10.5 Solutions for Chapter 6
1. Specify (6.23) and (6.24):
(a) Matrix form is given by y = Xβ + with
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
y1 1 x 1 z1 ⎛ ⎞ ε1
⎜ ⎟ ⎜ ⎟ β ⎜ ⎟
⎜ ⎟ ⎜ ⎟ ⎜ 1⎟ ⎜ ⎟
⎜ y2 ⎟ ⎜1 x 2 z 2 ⎟ ⎜ ⎟ ⎜ ε2 ⎟
y=⎜ ⎟ ⎜
⎜ .. ⎟ , X = ⎜ .. ..

.. ⎟ , β = ⎜β2 ⎟ , =⎜ ⎟
⎜ .. ⎟ .
⎜.⎟ ⎜. . .⎟ ⎝ ⎠ ⎜.⎟
⎝ ⎠ ⎝ ⎠ β3 ⎝ ⎠
yn 1 x n zn εn

(b) We have XT ΣX = nI3 . Applying


⎛ ⎞−1 ⎛ ⎞
a b 0 c −b 0
⎜ ⎟ ⎜ ⎟
⎜ ⎟ 1 ⎜ ⎟
⎜ b c 0⎟ = ⎜ −b a 0 ⎟
⎝ ⎠ (ac − b2 ) ⎝ ⎠
2
0 0 d 0 0 ac−b d
n n
with nxy = i=1 xi yi , n zy = i=1 zi yi we get
⎛ ⎞ ⎛ ⎞
4
− 2
0 4
+ n 2
0
⎜ 3 3 ⎟ 1⎜
3 3 ⎟
⎜ ⎟ ⎜ 2 ⎟
Γ−1 = ⎜− 23 4
0 ⎟ , Γ 1 = ⎜ 4
+ n 0 ⎟,
⎝ 3 ⎠ d⎝ 3 3 ⎠
d
0 0 1 0 0 1+n
⎛ ⎞
ny(2 + 3n 2 ) + nxy + n + 2⎟
2 ⎜⎜

γ1 = ⎜ny + (2 + 3n 2 )nxy + n + 2⎠

3d ⎝
3
2 d(n + 1)−1(nzy + 1)

where d = ( 43 + n)2 − 49 .
2. Simple linear regression with Jeffreys prior :
(a) By Corollary 6.4, πJeff (θ) ∝ σ −2 .
n
(b) Posterior
n distributions: Set sxx = i=1 (xi − x̄)2= nx2 − n(x̄)2 , sxy =
n n
(x − x̄)(y − ȳ) = nxy − nx̄ȳ with nx̄ = x , nȳ = i=1 yi
i=1 i
2
 n
i
2
 n
i=1 i
and nx = i=1 xi , nxy = i=1 xi yi . It holds that
⎛ ⎞T
1 . . . 1
XT = ⎝ ⎠ ,
x 1 . . . xn
⎛ ⎞ ⎛ ⎞
n nx̄ 1 ⎝ x2 −x̄
XT X = ⎝ ⎠, (XT X)−1 = ⎠.
nx̄ nx2 sxx −x̄ 1
sxy
The least-squares estimates are α = ȳ − x̄β, β = sxx .
SOLUTIONS FOR CHAPTER 6 301
i. Applying Theorem 6.7 and (6.40) we obtain
⎛ ⎞ ⎛⎛ ⎞ ⎞
α α
⎝ ⎠ |y, σ 2 ∼ N2 ⎝⎝ ⎠ , σ 2 (XT X)−1 ⎠ . (10.4)
β β
2
ii. The marginal distribution of β in (10.4) is β|y, σ 2 ∼ N(β, sσxx ).
iii. Applying Lemma 6.1 on page 132 to (10.4) with
x̄ 1 1
Σ21 Σ−1
11 = − , Σ22 − Σ221 Σ−1
11 = ¯ − αx̄)
, μ2|1 = ¯2 (xy
x2 nx2 x
gives  
xy σ2x̄
β|(α, σ , y) ∼ N
2
−α , .
x2 x2 sxx
Note that, the mean in the posterior corresponds to the least-squares
estimator ofβ when the intercept α is known.
n
iv. Set RSS = i=1 (yi − yi )2 , with yi = α + xi β. Applying Theorem 6.7
and (6.40),  
n − 2 RSS
σ |y ∼ InvGamma
2
, . (10.5)
2 2
v. From (ii) and (iv) it follows that (β, σ 2 )|y ∼ NIG(n − 2, RSS, β, s−1
xx ).
Applying Lemma 6.5 we get
RSS
β ∼ t1 (n − 2, β, σ 2 s−1 2
xx ), with σ = .
n−4
3. Simple linear regression with variance as parameter of interest:
(a) In case of known α and β it is an i.i.d. sample problem with i = yi −
α − βxi . Hence,
  n2
1  2
n
1 1
π(σ |y, β, α) ∝ 2
2
exp − .
σ σ2 2σ 2 i=1 i
n
This is the kernel of InvGamma( n2 , 2b ) with b = i=1 (yi − α − βxi )2 .
(b) The posteriors are different. In Problem 2 (b) the regression parameters
are estimated and the degree of freedom is reduced by the number of
estimated parameters.
4. Transformation of simple linear regression model to a centered model : We
eliminate the intercept α, by
ȳ = α + β x̄ + ε̄, yi − ȳ = α − α + β(xi − x̄) + εi − ε̄.
n
(a) The transformation is given by ξi = εi − n1 j=1 εj . Set 1n = (1, . . . , 1)T
the n-dimensional column vector of ones. Then
1 1
ξ = ε − 1n 1Tn ε = Pε, P = In − 1n 1Tn
n n
where P is a projection matrix.
302 SOLUTIONS
(b) Applying the projection properties we obtain a singular covariance ma-
trix
Cov(ξ) = PCov(ε)PT = σ 2 P, det(P) = 0.
(c) Deleting the last observation:
1 1
ξ(−n) = ε(−n) − 1n−1 1Tn ε = Aε(−n) − εn 1n−1
n n
with A = In−1 − n1 1n−1 1Tn−1 , where A is not a projection matrix. Further
1
Cov(ξ(−n) ) = σ 2 (AA + 1n−1 1Tn−1 ) = σ 2 Σ
n2
with Σ = In−1 − n1 1n−1 1Tn−1 . Let e be an eigenvector orthogonal to 1n−1 .
Then Σe = e, with corresponding eigenvalue 1. Consider the eigenvector
v = a1n−1 . Then Σv = n−1 v and the eigenvalue is n1 = 0. Thus det(Σ) =
1
n > 0.
(d) Consider model (6.159) with deleted last observation and ξ(−n) ∼
N(0, σ 2 Σ), where
1
Σ = In−1 − 1n−1 1Tn−1 , Σ−1 = In−1 + 1n−1 1Tn−1
n
(x(−n) − x̄1n−1 )T Σ−1 (x(−n) − x̄1n−1 ) = sxx
(x(−n) − x̄1n−1 )T Σ−1 (y(−n) − ȳ1n−1 ) = sxy
i. Model (6.159) with deleted last observation and Jeffreys prior: Apply-
ing Theorem 6.7 and using the expressions above we have β|y, σ 2 ∼
sxy σ 2  
N1 sxx , sxx and σ 2 |y ∼ InvGamma n−2 RSS
2 , 2 , which is the same
result as in model (10.5).
ii. Model (6.159) with deleted last observation and conjugate prior,
σ 2 = 1: Applying Corollary 6.1 and using the expressions above, we
λ s +γb
have β|y ∼ N1 (μ1 , σ02 ) with μ1 = λ22 sxy
xx +1
and σ02 = λ2 sλxx2 +1 . This
posterior is different from the posterior (6.71), given in Example 6.9
on page 151.
5. Precision parameter : The conjugate family includes the normal-gamma dis-
tributions, defined as follows. A vector valued random variable X with sam-
ple space X ⊆ Rp and a positive random scalar λ have a normal-gamma
distribution
(X, λ) ∼ NGam(a, b, μ, P)
iff  
−1 −1 a b
X|λ ∼ Np (μ, λ P ), λ ∼ Gamma , .
2 2
The kernel of the density of NGam(a, b, μ, P) is given as
 
a+p−2 λ
λ 2 exp − (b + (X − μ) P(X − μ)) .
T
2
SOLUTIONS FOR CHAPTER 6 303
−2
Set λ = σ . Suppose that θ = (β, λ) ∼ NGam(a0 , b0 , γ0 , P0 ). Then
 
a0 +p−2 λ
π(θ|y) ∝ λ 2 exp − (b0 + (β − γ0 )T P0 (β − γ0 ))
2
 
n λ
× λ 2 exp − (y − Xβ)T (y − Xβ) .
2
Applying (6.25) given in Lemma 6.4 on page 138 we obtain θ|y ∼
NGam(a1 , b1 , γ1 , P1 ) where a1 = a0 + n,
b1 = b0 + (y − Xγ1 )T (y − Xγ1 ) + (γ0 − γ1 )T P0 (γ0 − γ1 ),
P1 = P0 + XT X, γ1 = P−1 T
1 (P0 γ + X y).
6. Jeffreys prior for the scale parameter : Set η T = (η1T , η2 ) = (β T , σ) and
θT = (θ1T , θ2 ) = (β T , σ 2 ). From Theorem 6.6 we know that
⎛ ⎞
1 T
X X 0
I(θ) = ⎝ θ2 ⎠.
n
0 2θ 2
2

Using transformation θ1 = h1 (η) = η1 , θ2 = h2 (η) = η22 , we get the Jaco-


bian matrix ⎛ ⎞
Ip 0
J=⎝ ⎠.
0 2η2
Applying (3.31), with I(η) = JT I(θ)J at θ = h(η), we obtain
⎛ ⎞⎛ ⎞⎛ ⎞ ⎛ ⎞
1 T T
Ip 0 X X 0 I 0 1 X X 0
I(η) = ⎝ ⎠ ⎝ η2 ⎠⎝ ⎠= ⎝ ⎠.
2 p
n η 2
0 2η2 0 2η 4
0 2η 2 2 0 2n
2

−(p+1)
Using Definition 3.4 we obtain π(η) ∝ η2 . Under the independence
assumption we calculate separately π(η1 ) ∝ const, and π(η2 ) ∝ η2−1 , so
that π(η) ∝ η2−1 .
7. Three sample problem:
(a) Reformulation as linear model is given by
⎛ ⎞ ⎛ ⎞⎛ ⎞
XT 1m 0 0 1 0 ⎛ ⎞
⎜ ⎟ ⎜ ⎟⎜ ⎟ μ
⎜ ⎟ ⎜ ⎟⎜ ⎟
1 ⎟⎝ ⎠ + ,
1
y = ⎜ Y T ⎟ = ⎜ 0 1m 0 ⎟ ⎜ 0
⎝ ⎠ ⎝ ⎠⎝ ⎠ μ2
ZT 0 0 1m −1 −1

with ∼ N3m (0, σ 2 Σ)


⎛ ⎞ ⎛ ⎞
λ I 0 0 1m 0
⎜ 1 m ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
Σ=⎜ 0 λ 2 Im 0 ⎟ and X = ⎜ 0 1m ⎟ .
⎝ ⎠ ⎝ ⎠
0 0 λ 3 Im −1m −1m
304 SOLUTIONS
(b) Applying Theorem 6.4 we obtain the posterior
⎛ ⎞
μ 1
⎝ 1 ⎠ |σ 2 , X, Y, Z ∼ N2 (γ1 , σ 2 Γ1 ), d = ,
μ2 n((2 + n1 )2 − 1)
⎛ ⎞ ⎛ ⎞
2 + n1 −1 (2 + n1 )(x̄ − z̄) − (ȳ − z̄)
Γ1 = d ⎝ ⎠ , γ1 = d ⎝ ⎠.
−1 2 + n1 (2 + n1 )(ȳ − z̄) − (x̄ − z̄)
(c) Posterior based on two samples: Applying Theorem 6.3 with τ 2 = σ 2 we
obtain
⎛ ⎞ ⎛⎛ ⎞ ⎛ ⎞⎞
nȲ 1
μ1 0
⎝ ⎠ |σ 2 , X, Y ∼ N2 ⎝⎝ n+1 ⎠ , σ 2 ⎝ n+1 ⎠⎠ .
nX̄ 1
μ2 n+1 0 n+1

(d) Comparison of posteriors covariance matrices: The posterior based on


three samples has a smaller covariance matrix. It holds that I2 − (n +
1)Γ1 = n+1
d D with
⎛ ⎞
−(1 + n1 ) 1
D=⎝ ⎠ , det(D) = (1 + 1 )2 − 1 > 0.
1 −(1 + 1 ) n
n

8. Parallel regression lines:


(a) Set x = (x1 , . . . , xn )T , y = (y1 , . . . , yn )T , z = (z1 , . . . , zn )T , ε =
(ε1 , . . . , εn )T , and ξ = (ξ1 , . . . , ξn )T . Reformulation of (6.161) as linear
model is given by
⎛ ⎞
⎛ ⎞ ⎛ ⎞ α ⎛ ⎞
y 1n 0 x ⎜ ⎟ ε
y=⎝ ⎠=⎝ ⎠⎜ ⎟
⎜γ ⎟ + ⎝ ⎠ ,
z 0 1n x ⎝ ⎠ ξ
β

with error distribution N2n (0, σ 2 I2n ).


n n n
(b) Set ny = i=1 yi , nz = i=1 zi , and nx2 = i=1 x2i . We have
⎛ ⎞ ⎛ ⎞
n 0 0 ny
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
XT Σ−1 X = ⎜ 0 n T −1
0 ⎟ and X Σ y = ⎜ nz ⎟.
⎝ ⎠ ⎝ ⎠
T
0 0 2nx2 x (y + z)

Applying Corollary 6.1


⎛ ⎞ ⎛⎛ ⎞ ⎛ ⎞⎞
n 1
α y 0 0
⎜ ⎟ ⎜⎜ n+λ ⎟ ⎜ n+λ ⎟⎟
⎜ ⎟ 2 ⎜⎜ ⎟ 2⎜ ⎟⎟
⎜ γ ⎟ |σ , y ∼ N3 ⎜⎜ n+λ n
z ⎟, σ ⎜ 0 1
0 ⎟⎟ ,
⎝ ⎠ ⎝⎝ ⎠ ⎝ n+λ ⎠⎠
β d xT (y + z) 0 0 d
SOLUTIONS FOR CHAPTER 6 305
so that
β|σ 2 , y ∼ N(dxT (y + z), σ 2 d) with d = (2nx2 + λ)−1 .
(c) Applying Theorem 6.5, especially (6.59), (6.66), (6.69) gives σ 2 |y ∼
b1
InvGamma( a+2n
2 , 2 ) with
1 1
b1 = b + y T y + z T z − (ny)2 − (nz)2 − d(xT (y + z))2 .
n+λ n+λ
(d) The results above give (β, σ 2 )|y ∼ NIG(a + 2n, b1 , dxT (y + z), d). Lemma
6.5 delivers
b1
β|y ∼ t(a + 2n, d xT (y + z), d).
a + 2n
n n
9. Linear mixed model : Set nxy = i=1 xi yi , nxz = i=1 xi zi . A conjugate
prior is N(β0 , σ02 ). Applying Theorem 6.10 with σ 2 = 1 gives
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
β n+1 −nxz m
⎝ ⎠ |y ∼ N2 (α1 , Ψ1 ), Ψ1 = d ⎝ ⎠ , α1 = d ⎝ 1 ⎠
γ −nxz n + σ0−2 m2
with
 −1
d = (n + 1)(n + σ0−2 ) − n2 xz 2 , m1 = (n + 1)(xz + σ0−2 β0 ) − n2 xz xy.
Hence β|y ∼ N(d m1 , d (n + 1)).
10. Two correlated lines: Set x = (x1 , . . . , xn )T , y = (y1 , . . . , yn )T , z =
(z1 , . . . , zn )T , ε = (ε1 , . . . , εn )T , and ξ = (ξ1 , . . . , ξn )T .
(a) Reformulation of (6.162) as multivariate model is given by
⎛ ⎞
αy αz
Y = y z = XB + ε ξ , with X = 1n x , B = ⎝ ⎠,
βy βz
(10.6)
where
⎛ ⎞
σ12 σ12
Y|θ ∼ MNn,2 (XB, In , Σ) with Σ = ⎝ ⎠.
σ12 σ22
n n
(b) Set nxy = i=1 xi yi , nxz = i=1 xi zi . Applying Theorem 6.11 gives
the posterior NIW(ν1 , B1 , C1 , Σ1 ) with ν1 = ν + n,
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
c1 c1 c1
α̃y α̃z y z 0
B1 = ⎝ ⎠ = ⎝ 1+nc1 1+nc1 ⎠
, C1 = ⎝ 1+c1 n ⎠.
c2 c2 c2
β̃y β̃z 1+nc2 xy 1+nc2 xz 0 1+c2 n
(10.7)
Further E(Σ|Y) = ν11−3 Σ1 . Set ỹi = α̃y + β̃y xi and z̃i = α̃z + β̃z xi . Then

1 
n
1 1
E(σ12 |Y) = (yi − ỹi )(zi − z̃i ) + α̃y α̃y + β̃y β̃z .
ν+n−3 i=1
c1 c2
306 SOLUTIONS
10.6 Solutions for Chapter 7
1. Sample from Gamma(ν, θ): Likelihood and prior, θ ∼ Gamma(α, β), are
given by

n
(θ|x) ∝ θ nν
exp(− xi θ), π(θ) ∝ θα−1 exp(−βθ).
i=1

The posterior, π(θ|x) ∝ π(θ)(θ|x), is



n
π(θ|x) ∝ θnν+α−1 exp(−( xi + β)θ),
i=1
n
i.e., θ|x ∼ Gamma(nν + α, i=1 xi + β).
(a) Applying (7.20)
nν + α
θL2 (x) = E(θ|x) = n .
i=1 xi + β

(b) Applying mode’s formula


nν + α − 1
θMAP (x) = mode(θ|x) = n .
i=1 xi + β
n
(c) Set F −1 for the quantile function of Gamma(nν + α, i=1 xi + β).
i. For each a ∈ [0, α] the quantiles qlow (a) = F −1 (a) and qupp (a) =
F −1 (a + 1 − α) give α–credible interval [qlow (a), qupp (a)].
ii. Find a∗ such that qupp (a) − qlow (a) is minimal.
iii. The HPD interval is given by [qlow (a∗ ), qupp (a∗ )].
2. Sample from N(μ, σ 2 ), θ = (μ, σ 2 ): Let θ ∼ NIG(a, b, γ, λ) be conjugate
prior. Then θ|x ∼ NIG(a1 , b1 , γ1 , λ1 ), as given in Example 7.3.
n n
(a) Set sxx = i=1 (xi − x)2 and nx = i=1 xi . The likelihood function is
 
2 −n 1
(θ|x) ∝ (σ ) 2 exp − 2 (sxx + n(μ − x) ) 2

and μMLE = arg maxμ (μ, σ 2 |x) = x. Further


 
1 sxx 1
2
σMLE = arg max (x, σ 2
|x) = arg max exp − = sxx .
σ2 σ2 σ 2n 2σ 2 n
n
(b) Set zi = xi − μ. Then x − μ = z, szz = i=1 (zi − z)2 , and Ezi = 0,
where
1 n
2
Cov(μMLE , σMLE |θ) = E z (zi − z)2 = 0,
n i=1

since E(zi zj zk ) = 0 for all i, j, k = 1, . . . , n, Ez 3 = 0, and E(zzi2 ) = 0.


SOLUTIONS FOR CHAPTER 7 307
(c) θ|x ∼ NIG(a1 , b1 , γ1 , λ1 ) implies μ|x, σ ∼ N(γ1 , σ λ1 ). The mode of a
2 2

normal distribution is the expectation, thus


1
μMAP = γ1 = (λγ + nx).
n+λ
Further σ 2 |x ∼ InvGamma( a21 , b21 ) and the mode formula gives

2 b1 n
σMAP = , b1 = b + sxx + x̄2 .
a1 + 2 n+1
n 2 2 λ2
Set 2
i=1 xi = nx and use sxx + nx = nx .
2 We have for c0 = b + λ12 γ 2 ,
2
λ
c1 = −2 λ12 γ, c2 = −λ21 n2 , and c3 = n, that

2 1
σMAP = (c0 + c1 x + c2 x2 + c3 x2 ).
a+n+2
(d) Set μ = 0 and σ 2 = 1. Then X1 , . . . , Xn i.i.d. N(0, 1), with Ex = 0,
Ex3 = 0, Ex2 x = 0, and Ex2 = n1 . For γ = 0
 
nx 2 λ+n
2
Cov(μMAP , σMAP ) = E σ = −2γ 2 = 0.
a + λ MAP λ (a + n + 2)
3. Simple linear regression: Applying (6.64) and (6.62) we get θ|y ∼
N2 (γ1 , σ 2 Γ1 ), where γ1 = (α̃, β̃)T and
⎛ ⎞
λ1
0
Γ1 = ⎝ 1+λ1 n ⎠ , α̃ = λ1 ny + γa , β̃ = λ2 nxy + γb .
0 λ2 1 + nλ1 1 + λ2 sxx
1+λ2 sxx

(a) We have μ(z)|y ∼ N(μ̃(z), σ 2 s2 (z)) with μ̃(z) = α̃ + β̃z and


λ1 z 2 λ2
s2 (z) = + .
1 + λ1 n 1 + λ2 sxx
The HPD α0 -credible interval is given by
[μ̃(z) − z(1− α20 ) σs(z), μ̃(z) + z(1− α20 ) σs(z)],

where z(1− α20 ) is the (1 − α20 )–quantile of N(0, 1), i.e., Φ(zγ ) = γ.
(b) Applying Lemma 6.2 on page 132 to
⎛⎛ ⎞ ⎛ ⎞⎞
⎛ ⎞ 1 z s2 (z) + 1 (1, z)Γ1
y ⎜⎜ ⎟ ⎜ ⎛ ⎞ ⎟⎟
⎝ f ⎠ |y ∼ N3 ⎜ ⎜ ⎟ ⎜
⎜⎜1 0⎟ γ1 , σ 2 ⎜ T 1
⎟⎟
⎟⎟
θ ⎝ ⎝ ⎠ ⎝ Γ1
⎝ ⎠ Γ 1
⎠⎠
0 1 z
gives
 
yf |y ∼ N μ̃(z), σ 2 (s2 (z) + 1) . The prediction interval is given by
 
[μ̃(z) − z(1− α0 ) σ s2 (z) + 1, μ̃(z) + z(1− α20 ) σ s2 (z) + 1].
2
308 SOLUTIONS
4. Predictive distribution in binomial model : In Example 2.11 the posterior is
calculated as θ|X ∼ Beta(α, β) with α = α0 + x and β = β0 + n − x.
(a) Applying (7.33), the predictive distribution,
 1 
nf x f 1
π(xf |x) = θ (1 − θ)(nf −xf ) θα−1 (1 − θ)β−1 dθ
0 xf B(α, β)
   1
nf 1
= θxf +α−1 (1 − θ)(nf −xf +β−1) dθ
xf B(α, β) 0
 
nf B(α + xf , β + n − xf )
= ,
xf B(α, β)
is a beta-binomial distribution,

xf |x ∼ BetaBin(nf , α0 + x, β0 + n − x).

(b) A possible R code:


library(VGAM); k<-0:5; dbetabinom.ab(k,1,4,3)
5. Two independent lines, interested in averaged slope:
(a) Set x = (x1 , . . . , xn )T , y = (y1 , . . . , yn )T , z = (z1 , . . . , zn )T , ε =
(ε1 , . . . , εn )T , and ξ = (ξ1 , . . . , ξn )T . Reformulation of (7.70) as linear
model is given by
⎛ ⎞
α
⎛ ⎞ ⎛ ⎞ ⎜ y⎟ ⎛ ⎞
⎜ ⎟
y 1n x 0 0 ⎜ βy ⎟ ε
y=⎝ ⎠=⎝ ⎠⎜ ⎟ + ⎝ ⎠, (10.8)
⎜ ⎟
z 0 0 1n x ⎜ α z ⎟ ξ
⎝ ⎠
βz

with error distribution N2n (0, σ 2 I2n ).


(b) We have

XT Σ−1 X = nI4 and XT Σ−1 y = n(y, xy, z, xz)T .

Applying Corollary 6.1, we get


⎛ ⎞ ⎛⎛ ⎞ ⎛ ⎞⎞
λ1
αy α̃y 0 0 0
⎜ ⎟ ⎜⎜ ⎟ ⎜ 1+nλ1 ⎟⎟
⎜ ⎟ ⎜⎜ ˜ ⎟ ⎜ ⎟⎟
⎜ βy ⎟ 2 ⎜⎜ β ⎟ ⎜ 0 λ2
0 0 ⎟⎟
⎜ ⎟ |σ , y ∼ N4 ⎜⎜ y ⎟ , σ 2 ⎜ 1+nλ2 ⎟⎟ ,
⎜ ⎟ ⎜⎜ ⎟ ⎜ ⎟⎟
⎜ αz ⎟ ⎜⎜α̃z ⎟ ⎜ 0 0 λ3 ⎟⎟
⎝ ⎠ ⎝⎝ ⎠ ⎝ 1+nλ3 ⎠⎠
λ4
βz β̃z 0 0 0 1+nλ4

with
nλ1 y + m1 nλ2 xy + m2
α̃y = , β̃y = ,
1 + nλ1 1 + nλ2
SOLUTIONS FOR CHAPTER 7 309
nλ3 z + m3 nλ4 xz + m4
α̃z = , β̃z = .
1 + nλ3 1 + nλ4
Then η|σ 2 , y ∼ N(η̃, σ 2 s2y ) with
 
1 1 λ2 λ4
η̃ = (β̃y + β̃z ), s2y = + .
2 4 1 + nλ2 1 + nλ4

Further, σ 2 |y ∼ InvGamma((a+2n)/2, b1 (y)/2) with b1 (y) given in (6.67)


as

n 
n
b1 (y) = b + (yi − ỹi )2 + (zi − z̃i )2
i=1 i=1
1 1
+ (m1 − α̃y )2 + (m2 − β̃y )2
λ1 λ2
1 1
+ (m3 − α̃z )2 + (m1 − β̃z )2
λ3 λ4

with ỹi = α̃y + β̃y xi and z̃i = α̃z + β̃z xi . From Lemma 6.5 it follows that

b1 (y) 2
η|y ∼ t(a + 2n, η̃, s ).
a + 2n y

(c) For known σ 2 the posterior is η|σ 2 , y ∼ N(η̃, σ 2 s2y ). Thus

C1 (y, σ 2 ) = {η : η̃ − z(1− α2 ) σsy ≤ η ≤ η̃ + z(1− α2 ) σsy }. (10.9)

(d) The Bayes L2 estimate of σ 2 is the expected value of InvGamma((a +


2n)/2, b1 (y)/2). Thus
b1 (y)
σ̃12 = .
a + 2n − 2
(e) The HPD α-credible interval is given as

C1 (y) = {η : η̃ −t(a+2n,1− α2 ) s1 (y) ≤ η ≤ η̃ +t(a+2n,1− α2 ) s1 (y)}, (10.10)

with  
2 b1 (y) 2 b1 (y) 1 λ2 λ4
s1 (y) = s = +
a + 2n y a + 2n 4 1 + nλ2 1 + nλ4
where t(df,γ) is the γ-quantile of t1 (df, 0, 1).
6. Averaged observations: Model (7.71) is a simple linear regression model
with Σ = 12 In .
(a) Applying Theorem 6.5 and Lemma 6.5,
 
c2
βu |u, σ 2 ∼ N β̃u , σ 2 , σ 2 |u ∼ InvGamma((a+n)/2, b1 (u)/2),
1 + 2nc2
310 SOLUTIONS
b1 (u)c2
βu |u ∼ t a + n, β̃u , s2 (u)2 , with s2 (u)2 = .
(a + n)(1 + 2nc2 )
Further ũi = α̃u + β̃u xi , where
2nc1 u 2nc2 xu
α̃u = , β̃u = ,
1 + 2nc1 1 + 2nc2

n
1 2 1
b1 (u) = b + 2 (ui − ũi )2 + α̃u + β̃u2 .
i=1
c1 c2

(b) For known σ the posterior is η|(σ 2 , y) ∼ N(β̃u , σ 2 s2u ) with s2u =
2 c2
1+2nc2 .
Thus

C2 (u, σ 2 ) = {η : η̃ − z(1− α2 ) σsu ≤ η ≤ η̃ + z(1− α2 ) σsu }.

(c) The Bayes L2 estimate of σ 2 is the expected value of InvGamma((a +


n)/2, b1 (u)/2). Thus
b1 (u)
σ̃22 = .
a+n−2
(d) The HPD α-credible interval for η = βu is

C2 (u) = {η : β̃u − t(a+n,1− α2 ) s2 (u) ≤ η ≤ β̃u + t(a+n,1− α2 ) s2 (u)},


(10.11)
where t(df,γ) is the γ-quantile of students t1 (df, 0, 1) distribution with df
degrees of freedom.
7. Comparing the models in the two previous problems:
(a) The prior of (η, σ 2 ) in model (7.70) is NIG(a, b, 12 (m2 + m4 ), 14 (λ2 + λ4 )).
The prior of (η, σ 2 ) in model (7.71) is NIG(a, b, 0, c2 ). Set m2 = m4 = 0
and λ2 = λ4 = 2c2 . Then (η, σ 2 ) has the prior NIG(a, b, 0, c2 ) in both
cases.
(b) For known σ 2 the HPD α-credible intervals for η coincide, since

c2 
n
1 nc2 1
β̃u = 2 (yi + zi )xi = (xy + xz) = (β̃y + β̃z ) = η̃.
1 + 2nc2 i=1 2 1 + 2nc2 2
 
1 λ2 λ4 1 2c2
s2y = + = = s2u .
4 1 + nλ2 1 + nλ4 2 1 + 2nc2
This is expected, because u is a sufficient statistic for η. The orthog-
onal design and the diagonal prior covariance matrices imply that the
intercepts have no influence on η in both models.
(c) Note that, u is not sufficient for σ 2 . Assume additionally that m1 =
m3 = 0 and λ1 = λ3 = 2c1 , λ2 = λ4 = 2c2 . Then

2c1  1
n
nc1 1
α̃u = (yi + zi ) = (y + z) = (α̃y + α̃z )
1 + 2nc1 i=1 2 1 + 2nc1 2
SOLUTIONS FOR CHAPTER 7 311
n
and ũi = α̃u + β̃u xi 
 = (ỹi + z̃i ). Set Ry = i=1 (yi − ỹi ) , Rz =
1
2
2
n n n
i=1 (zi − z̃i ) 2
, R u = i=1 (ui − ũi ) and Rzy =
2
i=1 (zi − z̃i )(yi − ỹi ).
1
Thus 2Ru = 2 (Ry + Rz ) + Rzy . The estimates σ̃1 and σ̃22 are different,
2

mainly because b1 (y) and b1 (u) are different, i.e.,


1 1
b1 (y) = b + Ry + Rz + (α̃y2 + α̃z2 ) + (β̃ 2 + β̃z2 )
2c2 2c1 y
1 1 1
b1 (u) = b + (Ry + Rz ) + Rzu + (α̃y + α̃z )2 + (β̃y + β̃z )2 .
2 4c1 4c2
(d) Comparing the intervals (10.10) and (10.11): The center of the inter-
vals coincides. But s1 (y) and s2 (u) are different. Especially the degrees
of freedom in student’s distributions are different. In model (7.70) 2n
observations are used, where in model (7.71) only n observations are
used.
8. Two dependent lines, interested in averaged slope: The corresponding mul-
tivariate regression model is defined in (10.6). The posterior distribution is
θ ∼ NIW(ν1 , B1 , C1 , Σ1 ) with ν1 = ν + n, B1 , and C1 given in (10.7) and,
using the notations Ry , Rz , Ryz above,
⎛ ⎞
σ̃12 σ̃12
Σ1 = ⎝ ⎠
σ̃21 σ̃22
with
1 2 1 1 1
σ̃12 = 1 + Ry + α̃ + β̃ 2 , σ̃22 = 1 + Rz + α̃z2 + β̃z2
λ1 y λ2 y λ1 λ2
1 1
σ̃12 = σ̃21 = Ryz + α̃y α̃z + β̃y β̃z .
λ1 λ2
(a) The Bayes estimates of βy and βz are β̃y and β̃z , respectively. The pa-
rameter of interest is linear in βy and βz , i.e., η̃ = 12 (β̃y + β̃z ). Note that
estimates η̃ in model (7.70) with prior NIG(a, b, 0, diag(λ1 , λ2 , λ1 , λ2 ))
and in model (7.71) with prior NIG(a, b, 0, 12 diag(λ1 , λ2 )) coincide.
(b) Set Σ as known. Applying Theorem 6.11 on page 176 gives B|Y, Σ ∼
MN2,2 (B1 , C1 , Σ). The parameter of interest is η = ABC, where A =
1 T
2 (0, 1), C = (1, 1) . Using (7.73), we obtain
 
λ2
η ∼ MN1,1 η̃, , σ12 + σ22 + 2σ12 ≡ N(η̃, s2 )
4(1 + nλ2 )
with
λ2 (σ12 + σ22 + 2σ12 )
s2 = .
4(1 + nλ2 )
Thus
C3 (Y, Σ) = {η : η̃ − z(1− α2 ) s ≤ η ≤ η̃ + z(1− α2 ) s}.
Under σ1 = σ2 = σ, σ12 = 0, and λ2 = λ4 , we have s2 = s2y σ 2 and
C3 (Y, Σ) = C1 (y, σ 2 ) given in (10.9).
312 SOLUTIONS
(c) Applying Theorem 6.11 on page 176 gives

B|Y ∼ t2,2 (ν + n − 1, B1 , C1 , Σ1 ).

Using (7.74) and (6.148) we get


λ2
η|Y ∼ t1,1 (ν + n − 1, η̃, , 1T Σ1 12 ) ≡ t(ν + n − 1, η̃, s(Y)2 ),
4(1 + nλ2 ) 2
1 λ2
with s(Y)2 = T
ν+n−1 4(1+nλ2 ) 12 Σ1 12 where

1 1
1T2 Σ1 12 = 2 + Ry + 2Ryz + Rz + (α̃y + α̃z )2 + (β̃y + β̃z )2 .
λ1 λ2

(d) The HPD α-credible interval for η is

{η : η̃ − t(ν+n−1,1− α2 ) s(Y) ≤ η ≤ η̃ + t(ν+n−1,1− α2 ) s(Y)}, (10.12)

where t(df,γ) is the γ-quantile of t1 (df, 0, 1) distribution with df degrees


of freedom.

10.7 Solutions for Chapter 8


1. Poisson distribution:
n
(a) Set θ ∼ Gamma(α, β). Then θ|x ∼ Gamma(α + i=1 xi , β + n), since
n  xi 
θ
π(θ)(θ|x) ∝ θα−1 exp(−θβ) exp(−θ)
i=1
xi !
n
∝ θα−1+ i=1 xi
exp(−θ(β + n)).

(b) Applying (8.4) we obtain



⎨ 1 if Pπ (θ ≥ 1|x) < 0.5
ϕ(x) = .
⎩ 0 if Pπ (θ ≥ 1|x) ≥ 0.5

Since the median of a gamma distribution has no closed form, we cannot


simplify this Bayes rule.
(c) The posterior is Gamma(17, 21) and Pπ (θ ≥ 1|x) = 0.163. We reject H0 .
2. Gamma distribution:
n
(a) θ|x ∼ Gamma(α1 , β1 ) with α1 = α0 + nα and β1 = β0 + i=1 xi ; see
Problem 1 in Chapter 7.
(b) The Bayes factor is defined in (8.18). We have

θαn  α−1 
n n
p(x|θ) = x exp(−θ xi ).
Γ(α)n i=1 i i=1
SOLUTIONS FOR CHAPTER 8 313
∞
Using the integral 0 xa−1 exp(−bx) dx = b−a Γ(a), we obtain

n
m(x) = Γ(α)−n Γ(α0 )−1 β0α0 xiα−1 Γ(α1 )β1−α1 .
i=1

Hence
Γ(α0 ) θ0αn n n
α0 +nα
B01 = (β 0 + x i ) exp(−θ 0 xi ).
Γ(α0 + nα) β0α0 i=1 i=1

(c) Applying the table from Kass and Vos (1997), we get very strong evi-
dence against the null hypothesis.
3. Two sample problem, Delphin data:
(a) In both models the parameter θ is related to location parameter. The
non-informative prior is Jeffreys, π(θ) ∝ const; see Example 3.21.
(b) First we consider the model {P0 , π}. Only the distribution of the first m
observations depends of θ. The posterior is N(γ(0) , Γ(0) ) with

1 
m
1 2
γ(0) = x̄(1) , Γ(0) = σ , x̄(1) = xi .
m 1 m i=1

The second model {P1 , π} is the special case of a linear model. We apply
Theorem 6.7 on page 157 , where p = 1, σ 2 = 1, X = (1, . . . , 1, a, . . . , a)T
2 2 2 2 (2) 1 n
and Σ = diag(σ1 , . . . , σ1 , σ2 , . . . , σ2 ). Set x̄ = n−m i=m+1 xi . We
obtain the posterior N(γ(1) , Γ(1) ) with

σ1−2 m x̄(1) + σ2−2 (n − m)ax̄(2)


γ(1) = ,
σ1−2 m + σ2−2 (n − m)a2

Γ(1) = (σ1−2 m + σ2−2 (n − m)a2 )−1 .


(c) In the first model the fitted values are

1 
m
(0) (0)
xi = xi = x̄(1) , i = 1, . . . , m, xi = μ, i = m + 1, . . . , n,
m i=1

with
1  1 
m n
RSS(0) = 2 (xi − x̄(1) )2 + 2 (xi − μ)2 .
σ1 i=1 σ2 i=m+1
In the second model we have
(1) (1)
xi = γ (1) , i = 1, . . . , m, xi = aγ (1) , i = m + 1, . . . , n,

with
1  1 
m n
RSS(1) = (x i − γ (1) ) 2
+ (xi − aγ(1) )2 .
σ12 i=1 σ22 i=m+1
314 SOLUTIONS
(d) The Bayes factor is
 ∞
m0 (x)
B01 = , mj (x) = pj (x|θ) dθ, j = 0, 1.
m1 (x) −∞

We have n
m0 (x) exp(− 2σ1 2 i=m+1 (xi − μ)2 )A1
2
=
m1 (x) A2
with 
1 
m
A1 = exp − (xi − θ)2 dθ
2σ12 i=1
and

1  1 
m n
A2 = exp − (x i − θ) 2
− (xi − aθ)2 dθ.
2σ12 i=1 2σ22 i=m+1
 
First consider A1 . Applying
√ (xi − θ)2 = (xi − x̄)2 + m(θ − x̄)2 and
1
exp(− 2a (x − b)2 )dx = 2πa, we get

√ 1 
m
σ12 1
A1 = 2π( ) 2 exp − 2 (xi − x̄(1) )2 .
m 2σ1 i=1

For A2 , applying (6.86), we have

1  1 
m n
(x i − θ) 2
+ (xi − aθ)2 = RSS(1) + Γ−1
(1) (θ − γ(1) ) .
2
σ12 i=1 σ22 i=m+1

Thus √ 1
A2 = 2π(Γ(1) ) 2 exp −RSS(1) .

Summarizing,
 
n − m σ12 2 1 1
B01 = (1 + a ) 2 exp (RSS(1) − RSS(0) ) .
n σ22 2

(e) We obtain RSS(0) = 45.6062, RSS(1) = 32.7241, 2 ln(B10 ) = 22.98. The


evidence against H0 is very strong. We prefer the second model.
4. Corona example: Set yT = (y11 , . . . , y1n , y21 , . . . y2n ) and xk =
(xk,1 , . . . , xk,n , xk,n+1 , . . . , xk,2n )T , k = 1, . . . , 4. Then we have a linear
model with design matrix
⎛⎛ ⎞ ⎛ ⎞ ⎞
1n 0n
X = ⎝⎝ ⎠ ⎝ ⎠ x1 x2 x3 x4 ⎠ .
0n 1n
SOLUTIONS FOR CHAPTER 8 315
(a) We apply Corollary 6.1 on page 135 and obtain θ|y ∼ N(θ, Γ) with
θ = (η1 , η2 , β1 , . . . , β4 ) = σ12 ΓXT y and Γ = (I6 + σ12 XT X)−1 .
(b) Set h = (1, −1, 0, 0, 0, 0)T and ση2 = hT Γh. Then

(η1 − η2 )|y ∼ N(η1 − η2 , ση2 ).

(c) Reject H0 iff
   
0.2 − (η1 − η2 ) −0.2 − (η1 − η2 )
Φ −Φ ≤ 0.5,
ση ση

where Φ is the distribution function of N(0, 1).


5. Test of parallelism:
(a) See the notation and solution of Problem 8 of Chapter 6. We apply
(0)
Theorem 6.5 on page 150 and obtain the posterior NIG(a1 , b1 , γ (0) , Γ(0) )
with a1 = a + n,
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
1 n
0 0 y α
⎜ n+1 ⎟ ⎜ n+1 ⎟ ⎜ y⎟
⎜ ⎟ ⎜ ⎟ ⎜ ⎟
Γ =⎜ 0
(0) 1
0 ⎟, γ = ⎜
(0) n
n+1 z
⎟ = ⎜ αz ⎟
⎝ n+1 ⎠ ⎝ ⎠ ⎝ ⎠
1 1 T
0 0 2n+1 2n+1 x (y + z) β

and

(0)

n
(0)

n
(0)
b1 = b + (yi − yi )2 + (zi − zi )2 + (γ (0) )T γ (0) (10.13)
i=1 i=1

(0) (0)
with yi = αy + βxi , zi = αz + βxi . Setting

n 
n
R1 = (yi − αy )2 , R2 = (zi − αz )2 ,
i=1 i=1

we obtain
(0)
b1 = b + R1 + R2 + (αy )2 + (αz )2 − (2n + 1)(β)2 . (10.14)

(b) See the notation and solution of Problem 5(a), (b) of Chapter 6. We
(1)
apply Theorem 6.5 and obtain the posterior NIG(a1 , b1 , γ (1) , Γ(1) ) with
a1 = a + n,
⎛ ⎞ ⎛ ⎞
n
y α
⎜ n+1 ⎟ ⎜ y ⎟
⎜ n ⎟ ⎜ ⎟
1 ⎜ n+1 z ⎟ ⎜αz ⎟
Γ(1) = I4 , γ (1) = ⎜ ⎟ ⎜ ⎟
⎜ 1 T ⎟ = ⎜ ⎟.
n+1 ⎜ n+1 x y ⎟ ⎜ βy ⎟
⎝ ⎠ ⎝ ⎠
1 T
n+1 x z βz
316 SOLUTIONS
Note that, the estimates of αy and of αz coincide in both models, because
of the condition x̄ = 0. Further

(1)

n
(1)

n
(1)
b1 = b + (yi − yi )2 + (zi − zi )2 + (γ (1) )T γ (1) (10.15)
i=1 i=1

(1) (1)
with yi = αy + βy xi , zi = αz + βz xi . We obtain
(1)
b1 = b + R1 + R2 + (αy )2 + (αz )2 − (n + 1)(βy )2 − (n + 1)(βz )2 . (10.16)

n+1
(c) Note that, β = 2n+1 (βy + βz ). Applying (10.14) and (10.16) we get

1 (0) (1) n+1 1 2n + 2


(b1 − b1 ) = ( βy − βz ) 2 − βy β z
n 2n + 1 n 2n + 1
n+1
= (βy − βz )2 + op (1).
2n + 1

(d) We apply (8.20) and get


a+n
(1) 2
n+1 b1
B01 =√ (0)
.
2n + 1 b1

6. Variance test in linear regression:


(a) Under H0 the variance is known. Thus the parameter is θ(0) = β. Jeffreys
prior is π0 (β) ∝ const. Applying Theorem 6.7 on page 157 with Σ = In ,
we get the posterior N(β, σ02 (XT X)−1 ) with β = (XT X)−1 XT y.
(b) Under H1 the variance is unknown. Thus the parameter is θ(1) = (β, σ 2 ).
Applying Theorem 6.7 with Σ = In , we get the posterior
NIG(n − p, RSS, β, (XT X)−1 ).
m0 (y)
(c) In Definition 8.2 the Bayes factor is given as B01 = m1 (y) with

j (θ(j) |y)πj (θ(j) )
mj (y) = j (θ(j) |y)πj (θ(j) ) dθ(j) = , j = 0, 1.
Θj πj (θ(j) |y)

First we calculate m0 (y). Set π0 (θ) = 1. Applying Lemma 6.4 on page


6.4, we obtain
1
0 (θ(0) |y)π0 (θ(0) ) = (2πσ02 )− 2 exp(−
n
RSS) k0 (β|y),
2σ02
with  
1
k0 (β|y) = exp − 2 (β − β)T XT X(β − β) .
2σ0
Since p
π(β|y) = (2πσ02 )− 2 |XT X| 2 k0 (β|y),
1
SOLUTIONS FOR CHAPTER 8 317
we get
n−p 1
m0 (y) = (2πσ02 )− |XT X|− 2 exp(−
1
2 RSS).
2σ02
Now we calculate m1 (y). Applying Lemma 6.4 we obtain

π1 (θ(1) )1 (θ(1) |y) = (2π)− 2 k1 (θ(1) |y),


n

with
n+2 1
k1 (θ(1) |y) = (σ 2 ) 2 exp(− (RSS + (β − β)T XT X(β − β))).
2σ 2
Since
  n−p
−p 1 1 RSS 2
π(β|y) = (2π) 2 |X X|
T 2 k1 (θ(1) |y),
Γ( n−p
2 )
2
we get
 − n−p
− n−p − 12 n−p RSS 2
m1 (y) = (2π) 2 |X X|
T
Γ( ) .
2 2
Summarizing,
  n−p
RSS 2
n − p −1 1
B01 = Γ( ) exp(− 2 RSS).
σ02 2 2σ0

(d) As B10 = B−1 2


01 and σ =
1
n−p RSS, we obtain

σ2
2 ln(B10 ) = (n − p)(x − ln(x)) + rest(n), x = ,
σ02
with
1 n−p
rest(n) = ln(Γ( )) − (n − p) ln(n − p).
2 2
The function f (x) = x − ln(x) has its minimum at x = 1. Using Stirling
approximation we obtain limn→∞ rest(n) = 0.
7. Test on correlation between two regression lines:
(a) Under H0 the lines are independent. We have the univariate model (10.8)
given in Problem 5(a), Chapter 7. Using the notation and solution of
1
Problem 5(b), Chapter 7, the posterior is NIG(2 + 2n, b1 , γ, n+1 I4 ) with
n
γ T = (α y , β y , α z , β z ) T = (y, xy, z, xz)T .
n+1
Set yi = αy + xi βy and zi = αz + xi βz . Then b1 = b1,y + b1,z with

n 
n
b1,y = 1 + (yi − yi )2 + αy2 + βy2 , b1,z = 1 + (zi − zi )2 + αz2 + βz2 .
i=1 i=1
318 SOLUTIONS
(b) Under H1 the lines are dependent. We have model (10.6)

Y = XB + , ∼ MNn,2 (0, In , Σ).

Re-writing θ in matrix form θ = (B, Σ) and using θ ∼ NIW(2, 0, I2 , I2 )


1
we obtain the posterior NIW(2 + n, B1 , n+1 I2 , Σ1 ) given in (10.7). Note
that, the matrix B1 contains the same elements as the vector γ. Further,
we have
⎛ ⎞
b1,y b1,yz n
Σ1 = ⎝ ⎠ , b1,yz = (yi − yi )(zi − zi ) + αy αz + βy βz .
b1,yz b1,z i=1

(c) Using the expectations of the corresponding posteriors, we obtain


⎛ ⎞
2 0
σ 1 1
 =⎝
Σ ⎠ , with σ2 = b1 , Σ = Σ1 .
0 σ 2 2n n−2

Comparing both estimates we see that σ 2 = 12 (σ12 + σ22 ).


(d) In Definition 8.2 the Bayes factor is given as

m0 (y) c0,NIG c0,NIW


B01 = , m0 (y) = c0, , m1 (y) = c1, ,
m1 (y) c1,NIG c1,NIW
where c0, is the constant related to the likelihood under H0 , c0,NIG is re-
1
lated to NIG(2, 2, 0, I4 ) and c1,NIG to the posterior NIG(2 + 2n, b1 , n+1 I4 );
and c1, is the constant related to the likelihood under H1 , c0,NIW is re-
1
lated NIW(2, 0, I2 , I2 ) and c1,NIG to NIW(2+n, B1 , n+1 I2 , Σ1 ). Especially
 n+1
−n −1 −1 (n + 1)2 b1
c0, = (2π) , c0,NIG = (2π) , c1,NIG = (2π) ,
n! 2

c1, = c0, , c0,NIW = (24 π 3 )−1 ,


(n + 1)2 n+2
c1,NIW = c0,NIW √ n+1 |Σ1 |
2 .
n+2
πΓ( 2 )Γ( 2 )
Using the duplication formula, Γ(2z) = (2π)− 2 Γ(z)Γ(z + 12 ), with 2z =
1

n + 1, we obtain

(2π)n! n!
n+2 n+1 = Γ(n + 1) = 1.
Γ( 2 )Γ( 2 )

Summarizing, we get
n+1
√ n − 2 n+1 |Σ|
2
1
B01 = 2(n − 2)( ) |Σ| 2 .
n 
|Σ|
SOLUTIONS FOR CHAPTER 9 319
(e) It holds that
2  − ln |Σ| + rest(n)
ln(B10 ) = ln |Σ|
n+1
with
1 n−2 2 1
rest(n) = ln( )+ ln(n − 2) + (ln(2) + ln |Σ|) = oP (1),
2 n n+1 n+1
since the Bayes estimate is bounded in probability.

10.8 Solutions for Chapter 9


1. Stroke data:
(a) Treatment: X1 ∼ Bin(n1 , p1 ); Control: X2 ∼ Bin(n2 , p2 ) independent of
each other.
(b) Jeffrey priors: p1 ∼ Beta(0.5, 0.5) and p2 ∼ Beta(0.5, 0.5); see Example
3.20 on page 56.
√ √
(c) Least favourable priors: p1 ∼ Beta( n1 /2, n1 /2) and p2 ∼
√ √
Beta( n2 /2, n2 /2); see Example 4.20 on page 109.
(d) In Example 2.11 on page 16, it is shown that for prior p ∼ Beta(α, β),
the posterior is p|x ∼ Beta(α + x, β + n − x).
(e) Sampling of odds ratio:
oddsratio<-function(N,a1,b1,a2,b2){
theta<-rep(0,N); p1<-rbeta(N,a1,b1); p2<-rbeta(N,a2,b2);
theta<-p1/p2; return(theta)}
(f) Applying (8.4), we reject H0 : θ ≤ 1 under both priors.
2. Independent MC:
(a) For α = 1 and β = 20, the integral is
  1  
n
μ= π(θ)(θ|x)dθ = B(α, β)−1 θα+x−1 (1 − θ)β+n−x−1 dθ
0 x

(b) Method 1: Independent MC sampled from prior; Method 2: Independent


MC sampled from U[0, 1]; Method 3: Deterministic method implemented
in R.
(c) Method 2 applies factorization (9.5) with p(θ|x) = 1.
i. Draw θ(1) , . . . , θ(N ) from U[0, 1].
ii. Approximate μ by
 
1  (j) α+x−1
N
n
μ(x) = B(α, β)−1 (θ ) (1 − θ(j) )β+n−x−1 .
x N j=1
320 SOLUTIONS
(d) Results are different since the deterministic integral is approximated by
a random number in Methods 1 and 2.
(e) In Method 1 the generating distribution has variance 0.0021 and in
1
Method 2 the variance is 12 = 0.083. Observe that, a prior with variance
0.0021 is not a good choice for a Bayes model.
3. R code:
(a) We have an i.i.d sample from C(θ, 1) with prior θ ∼ N(2, 1).
(b) Rejection algorithm with trial C(3, 1) and constant M = 5.
(c) 10000 sampled from the posterior.
(d) Given current state θ(j] :
i. Draw θ from C(3, 1) and compute

(θ|x) n
r(θ, x) = , (θ|x) = fC (xi |θ, 1), π(θ) = fC (θ|3, 1)
π(θ) i=1

where fC (.|m, λ) denotes the density of Cauchy distribution with lo-


cation parameter m and scale parameter λ.
ii. Draw u independently from U[0, 1]. Then

⎨ θ if u ≤ r(θ, x)
θ(j+1) = .
⎩ new trial otherwise

4. MCMC for logistic regression:


(a) The prior is θ ∼ C(1, 1), such that π(θ) ∝ 1+(θ−1)
1
2 . The Likelihood is
n yi
exp(−θxi ) (1−exp(−θxi )) n−yi
(θ|x, y) ∝ i=1 1+exp(−θxi ) and trial is T (θ(j) , θ) =
for θ(j) − a ≤ θ ≤ θ(j) + a.
1
2a
(b) R code:
MCMC.walk<-function(a,seed,N)
{
rand<-rep(NA,N);rand[1]<-seed;
for(i in 2:N)
{rand.new<-rand[i-1]+a*runif(1,-1,1);
lik.new<-exp(rand.new*s)/(prod(1+exp(rand.new*x));
p.new<-lik.new/(1+(rand.new-1)^2));
lik.old<-exp(rand[i-1]*s)/(prod(1+exp(rand[i-1]*x));
p.old<-lik.old/(1+(rand[i-1]-1)^2));
r<-min(1,p.new/p.old);
if(runif(1)<r){rand[i]<-rand.new}else{rand[i]<-rand[i-1]}
};
return(rand)
}
SOLUTIONS FOR CHAPTER 9 321
(c) Calculate M<-MCMC.walk(a,seed,N ); acf(M) for a = 0.5, 1, 2. Best
choice a0 = 1. Calculate M1<- MCMC.walk(1,0,N ); plot.ts(M1) and
M2<- MCMC.walk(1,10,N ); lines(1:N,M2,col=2). Burn-in time k =
40
(d) Carry out: MCMC.walk(a0 ,0,N ) obtain θ(1) , . . . , θ(N ) . Calculate μ =
1
N −k #{θ
(j)
> 0.2, j > N − k}. We obtain μ = 0.0064
5. Gibbs sampling from N2 (μ, Σ):
(a) Main steps: For a current state (x(t) , y (t) ),
i. Generate
xt+1 /yt ∼ N(ρyt , 1 − ρ2 ).
ii. Generate
Yt+1 /xt+1 ∼ N(ρxt+1 , 1 − ρ2 ).
(b) gibbs-norm<-function(rho,N)
{
x=rep(0,N);y=rep(0,N)
for(i in 1:N){x[i]<-rnorm(1,rho*y[i],(1-rho^2));
y[i]<-rnorm(1,rho*x[i],(1-rho^2))
}
return(data.frame(x,y))
}
(c) After t iterations we get
⎛ ⎞ ⎛⎛ ⎞ ⎛ ⎞⎞
X (t) ρ2t−1 X (0) 1 − ρ4t−2 ρ − ρ4t−1
⎝ ⎠ ∼ N 2 ⎝⎝ ⎠,⎝ ⎠⎠
Y (t) ρ2t Y (0) ρ − ρ4t−1 1 − ρ4t

The chain is not stationary, but the limit distribution is the right one,
since for |ρ| < 1, limt→∞ ρt = 0.
6. ABC for Iris data:
(a) The data is an i.i.d. sample from a truncated bivariate normal distribu-
tion, TMVN(μ, Σ, a, b). This is N2 (μ, Σ) with
⎛ ⎞ ⎛ ⎞
μ1 σ12 ρσ1 , σ2
μ = ⎝ ⎠, Σ = ⎝ ⎠
μ2 ρσ1 σ2 σ22

truncated to the support [a1 , b1 ] × [a2 , b3 ]. Here we have [0, ∞) × [0, ∞).
The parameter is θ = (μ1 , μ2 , σ1 , σ2 , ρ), m1 is the average of the first
variable and m2 is the average of the second variable.
(b) The priors are σ12 ∼ U[0.2, 2], σ22 ∼ U[0.2, 2], ρ ∼ U[0, 0.9], μ1 ∼
TN(0, ∞, 3, 1), and μ1 ∼ TN(0, ∞, 2, 1), where TN(0, ∞, m, v 2 ) is the
normal distribution N1 (m, v 2 ) truncated on [0, ∞).
322 SOLUTIONS
(1) (N )
(c) The algorithm samples θ , . . . , θ independently from a distribution
which approximates the posterior. The posterior has no closed form ex-
pression.
(d) It is an ABC Algorithm: Given current state θ(j) :
i. Draw independently θ1 from TN(0, ∞, 3, 1), θ2 from TN(0, ∞, 2, 1), θ3
from U[0.2, 2], θ4 from U[0.2, 2], and θ5 from U[0, 0.9].
ii. For i = 1, . . . , 50, generate new observations
Znew,i ∼ TMVN(μ, Σ, a, b), where
⎛ ⎞ ⎛ ⎞
θ1 θ32 θ 5 θ 3 , θ4
μ = ⎝ ⎠, Σ = ⎝ ⎠.
θ2 θ5 θ3 θ4 θ42
n
iii. Calculate n1 i=1 Znew,i = (mnew,1 , mnew,2 ), and D = (m1 −
mnew,1 )2 + (m2 − mnew,2 )2 .
iv. For D < tol we update θ(j+1) = θ; otherwise we go back to Step 1.
Chapter 11

Appendix

Here we briefly list some of the most commonly used distributions in Bayesian
inference. For details, we recommend other sources given in bibliography.

11.1 Discrete Distributions


Consider a discrete random variable X with sample space X ⊆ Z. We denote
Poisson distribution Poi(λ), binomial distribution Bin(n, p), beta-binomial dis-
tribution BetaBin(n, α, β), negative binomial distribution NB(r, p), and geo-
metric distribution Geo(p).

Notation P(X = k) EX Var(X)


λk
Poi(λ) k! exp(−λ) λ λ
 n k
k p (1 − p) np(1 − p)
n−k
Bin(n, p) np
n B(α+k,β+n−k) nα nαβ α+β+n
BetaBin(n, α, β) k B(α+β) α+β (α+β)2 α+β+1
k+r−1 r r(1−p) r(1−p)
NB(r, p) k p (1 − p)k p p2
(1−p) (1−p)
Geo(p) p(1 − p)k p p2

DOI: 10.1201/9781003221623-11 323


324 APPENDIX

Notation R package R function

Poi(λ) dpois(k,lambda)

Bin(n, p) dbinom(k,n,p)

BetaBin(n, α, β) library(VGAM) dbetabinom.ab

(k,n,alpha,beta)

NB(r, p) dnbinom(k,r,p)

Geo(p) dgeom(k,p)

11.2 Continuous Distributions


Let X be a real-valued random variable with sample space X ⊆ R. We de-
note normal distribution N(μ, σ 2 ), t-distribution t1 (f, μ, σ 2 ), F-distribution
Ff1 ,f2 , exponential distribution Exp (λ), Cauchy distribution C(m, σ), Laplace
distribution La(μ, σ), beta distribution Beta(α, β), gamma distribution
Gamma(α, β), and inverse-gamma distribution InvGamma(α, β).

Notation f (x|θ) ∝ EX Var(X)


 
N(μ, σ ) 2
exp − 12 ( x−μ
σ )
2
μ σ2
− f +1
2
t1 (f, μ, σ 2 ) 1 + f1 ( x−μ
σ )
2
μ f
f −2 σ
2

f >2
f1 +f2

f1
−1 f1 2 f2 2f22 (f1 +f2 −2)
Ff1 ,f2 x 2 1+ f2 x f2 −2 f1 (f2 −2)2 (f2 −4)

f2 > 2 f2 > 4
1 1
Exp (λ) λ exp(−λx) λ λ2
 
2 −1
C(m, σ) 1 + ( x−m
σ ) – –
 . .
La(μ, σ) exp − . x−μ
σ
. μ 2σ 2
Beta(α, β) xα−1 (1 − x)β−1 α
α+β
αβ
(α+β)2 (α+β+1)
α α
Gamma(α, β) xα−1 exp(−xβ) β β2
β2
InvGamma(α, β) x−α−1 exp(− βx ) β
α−1 (α−1)2 (α−2)

α>1 α>2
MULTIVARIATE DISTRIBUTIONS 325

Notation R package R function

N(μ, σ 2 ) dnorm(x,mu,sigma)

t1 (f, μ, σ 2 ) library(metRology) dt.scaled(x,f,mu,sigma)

Ff1 ,f2 df(x,f1,f2)

Exp (λ) dexp(x,lambda)

C(m, σ) dcauchy(x,m,sigma)

La(μ, σ) library(ExtDist) dLaplace(x,mu,sigma)

Beta(α, β) dbeta(x,alpha,beta)

Gamma(α, β) dgamma(x,alpha,beta)

InvGamma(α, β) library(invgamma) dinvgamma(x,alpha,beta)

11.3 Multivariate Distributions


Let X be a random vector with sample space X ⊆ Rp . We denote multivariate
normal distribution Np (γ, Σ) and multivariate t-distribution tp (f, μ, Σ).

Further we list distributions of (X, λ) where X is a vector valued random


variable with sample space X ⊆ Rp and λ is a positive random scalar. We
denote normal-gamma distribution NGam(α, β, μ, Σ−1 ) and normal-inverse-
gamma distribution NIG(α, β, μ, Σ).

We also refer to the toolbox given in Chapter 6: Lemma 6.1, Lemma 6.2,
Theorem 6.1, and Lemma 6.5.
326 APPENDIX

Notation f (X|θ) ∝ EX Cov(X)


 
Np (μ, Σ) exp − 12 X − μ 2
Σ μ Σ
− f +p
2
tp (f, μ, Σ) 1+ 1
f X−μ 2
Σ μ f
f −2 Σ, f >2
exp(− λ
2 (β+ X−μ
2
Σ )) β
NGam(α, β, μ, P) −
p+α−2 EX = μ Cov(X) = α−2 Σ
λ 2

P = Σ−1 Eλ = α
β Var(λ) =

β 2 (α−1) , α>2

Cov(X, λ) = 0
exp(− 2λ
1
(β+ X−μ 2
Σ )) β
NIG(α, β, μ, Σ) p+α+2 EX = μ Cov(X) = α−2 Σ
λ 2
β
Eλ = α−2 Var(λ)
2β 2
α>2 = (α−2)2 (α−1)

Cov(X, λ) = 0
where X − μ 2
Σ = (X − μ)T Σ−1 (X − μ).

Notation R package R function

Np (μ, Σ) library(mvtnorm) dmvnorm(X,mu,Sigma)

tp (f, μ, Σ) library(mvtnorm) dmvt(X,df=f,mu,Sigma)

NGam(α, β, μ, P) library(lestat) mnormalgamma

(mu,P,2*alpha,2*beta)

NIG(α, β, μ, Σ) library(PIGShift) dmvnorminvgamma

(xx,2*alpha,2*beta,mu,Sigma)

11.4 Matrix-Variate Distributions


Let Z be a random matrix, with sample space Z ⊆ Rm×k . We de-
note matrix-variate normal distribution MNm,k (M, U, V) and matrix-variate
t-distribution tm,k (ν, M, U, V).
MATRIX-VARIATE DISTRIBUTIONS 327

Notation f (Z|θ) ∝ EZ Cov(vec(Z))


 
MNm,k (M, U, V) exp − 12 tr (Q) M V⊗U
ν+m+k−1
tm,k (ν, M, U, V) |Ik + Q|− 2 M 1
ν−2 (V ⊗ U)

ν>2

where Q(Z) = V−1 (Z − M)T U−1 (Z − M).

Let W be a positive definite random matrix with sample space W ⊂ Rk×k .


We denote Wishart distribution Wk (ν, V) and inverse-Wishart distribution
IWk (ν, V).

Notation f (W|θ) ∝ EW
ν−k−1   −1

Wk (ν, V) |W| 2 exp − 12 tr WV νV
ν+k+1   
IWk (ν, V) |W|− 2 exp − 12 tr W−1 V 1
ν−k−1 V

Further

Notation Cov(vec(W))

Wk (ν, V) ν(Ik2 + Kk,k )(V ⊗ V)


2 T
IWk (ν, V) (ν+1)v 2 (v−2) vec(V)vec(V)

2
+ (ν+1)v(v−2) (Ik2 + Kk,k )(V ⊗ V), ν > 2

where Kk,k is the commutation matrix of A ∈ Rk×k defined by Kk,k vec(A) =


k k
vec(A)T and given as Kk,k = j=1 i=1 Hij ⊗ HTij , Hij is a matrix with 1
at place (i, j) and zeros elsewhere. We refer to Kollo and von Rosen (2005)
for the moments of Wk (ν, V) and IWk (ν, V).

We also give the joint distribution of (Z, W), i.e., normal-inverse-Wishart


distribution, NIW(ν, M, U, V).
328 APPENDIX

Notation f (Z, W|θ) ∝ E(Z, W)


ν+k+m+1   
NIW(ν, M, U, V) |W|− 2 exp − 12 tr W−1 V (Ik + Q) EZ = M
1
Q = V−1 (Z − M)T U−1 (Z − M) EW = ν−k−1 V

Finally we list the corresponding R functions.

Notation R package R function

MNm,k (M, U, V) library(matrixNormal) dmatnorm(Z,M,U,V)

tm,k (ν, M, U, V) library(mniw) dMT(Z,nu,M,U,V)

Wk (ν, V) library(mniw) dwish(W,nu,V)

IWk (ν, V) library(mniw) diwish(W,nu,V)

NIW(ν, M, U, V) library(mniw) rmniw(n,M,U,V,nu)

random numbers
Bibliography

J. Albert. Bayesian Computation with R. Springer, 2005.


H. Augier, L. Benkoël, J. Brisse, A. Chamlian, and W. K. Park. Necroscopic
localization of mercury-selenium interaction products in liver, kidney, lung
and brain of Mediterranean striped dolphins (Stenella coeruleoalba) by silver
enhancement kit. Cell. and Molec. Biology, 39:765–772, 1993.
M.A. Beaumont. Approximative Bayesian Computation. Annual Review of
Statistics and Its Application, 6:379–403, 2019.
D.R. Bellhouse. The reverend Thomas Bayes, FRS: A biography to celebrate
the tercentenary of his birth. Statistical Science, 19(1):3–32, 2004.
J. O. Berger and J. M. Bernardo. On the developement of reference priors.
Bayesian Statistics, 4:35–60, 1992a.
J. O. Berger and J. M. Bernardo. Ordered group reference priors with appli-
cation to the multinomial problem. Biometrika, 79(1):25–37, 1992b.
J.O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer,
New York, 1980.
J.O. Berger, J.M. Bernado, and D. Sun. The formal definition of reference
priors. The Annals of Statistics, 37(2):905–938, 2009.
J. M. Bernardo. Reference posterior distributions for Bayesian inference. Jour-
nal of the Royal Statistical Society, Ser. B, 41(2):113–147, 1979.
J.M. Bernardo. Reference analysis. In D.K. Dey and C.R. Rao, editors,
Handbook of Statistics 25, pages 17–90. Elsevier, 2005.
C.M. Bishop. Pattern Regognition and Machine Learning. Springer, 2006.
G.E.P. Box and G.C. Tiao. Bayesian Inference in Statistical Analysis.
Addison-Wesley, 1973.
L.D. Broemeling. Bayesian Methods for Repeated Measures. CRC Press, 2016.
Ming-Hui Chen, Qi-Man Shao, and J.G. Ibrahim. Monte Carlo Methods in
Bayesian Computation. Springer, 2002.
B.S. Clarke and A.R. Barron. Jeffreys’ prior is asymptotically least favourable
under entropy risk. Journal of Statistical Planning and Inference, 41:37–60,
1994.
M.J. Crowder and D.J. Hand. Analysis of Repeated Measures. Chapman &
Hall, 1990.

329
330 BIBLIOGRAPHY
W.W. Daniel and C.L. Cross. Biostatistics: A Foundation for Analysis in the
Health Sciences, 10th ed. Wiley, 2013.
E. Demidenko. Mixed models: Theory and Applications with R. Wiley, 2013.
P. Diaconis and D. Freedman. On the consistency of Bayes estimates. The
Annals of Statistics, 14(1):1–26, 1986.
N. R. Draper and H. Smith. Applied Regression Analysis. Wiley, 1966.
J.A. Dupuis. Bayesian estimation of movement and survival probabilities from
capture-recapture data. Biometrika, 82(4):761–772, 1995.
G.H. Givens and J.L. Hoeting. Computational Statistics. Wiley, 2005.
A.K. Gupta and D.K. Nagar. Matrix Variate Distributions. CRC Press, 2000.
T.J. Hastie, R.J. Tibshirani, and M. Wainright. Statistical Learning with
Sparsity: The LASSO and Generalizations. CRC Press, 2015.
W.K. Hastings. Monte Carlo sampling methods using Markov chains and their
applications. Biometrika, 57:97–109, 1970.
A.E. Hoerl and R.W. Kennard. Ridge regression: Applications to nonorthog-
onal problems. Technometrics, 12:55–67, 1970.
W. James and C. Stein. Estimation with quadratic loss. In Proc. 4th Berkeley
Symp. Math. Statist. Probab., volume 1, pages 361–380, Berkeley, 1960.
Univ. of California Press.
H. Jeffreys. An invariant form for the prior probability in estimation problems.
Proceedings of the Royal Statistical Society of London, 186:453–461, 1946.
R. E. Kass and P.W. Vos. Geometrical Foundations of Asymptotic Inference.
Wiley, 1997.
R. E. Kass and L. Wasserman. An invariant form for the prior probability in
estimation problems. Journal of the American Statistical Association, 91
(435):1343–1370, 1996.
R.E. Kass and A.E. Raftery. Bayes factors. Journal of the American Statistical
Association, 90(430):773–795, 1997.
K-R. Koch. Introduction to Bayesian Statistics, 2nd ed. Springer, 2007.
T. Kollo and D. von Rosen. Advanced Multivariate Statistics with Matrices.
Springer, 2005.
Se Yoon Lee. Gibbs sampler and coordinate ascent variational inference: A
set theoretical review. Communications in Statistics-Theory and Methods,
51(6):1549–1568, 2022.
H. Liero and S. Zwanzig. Introduction to the Theory of Statistical Inference.
CRC Press, 2011.
F. Liese and K.-J. Miescke. Statistical Decision Theory. Springer, 2008.
Pi-Erh Lin. Some characterization of the multivariate t distribution. Journal
of Multivariate Analysis, 2:339–344, 1972.
B. W. Lindgren. Statistical Theory. Chapman & Hall, 1962.
BIBLIOGRAPHY 331
D.V. Lindley. On a measure of the information provided by an experiment.
The Annals of Mathematical Statistics, 27(4):986–1005, 1956.
D.V. Lindley and A.F.M. Smith. Bayes estimates for the linear model. Journal
of the Royal Statistical Society, Ser. B, 34:1–41, 1972.
Jun S. Lui. Monte Carlo Strategies in Scientific Computing. Springer Series
in Statistics. Springer, 2001. ISBN 0-387-95230-6.
J.R. Magnus and H. Neudecker. The elimination matrix: Some lemmas and
applications. SIAM Journal on Algebraic Discrete Methods, 1(4):422–449,
1980.
J.R. Magnus and H. Neudecker. Matrix Differential Calculus with Applications
in Statistics and Econometrics. Wiley, 1999.
K. V. Mardia, J. T. Kent, and J. M. Bibby. Multivariate Analysis. Academic
Press, 1979.
A.M. Mathai and H.J. Haubold. Special Functions for Applied Scientists.
Springer, 2008.
N. Metropolis, A. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller.
Equations of state calculations by fast computing machines. Journal of
Chemical Physics, 21(6):1087–1091, 1953.
T.Tin. Nguyen, Hien D. Nguyen, F. Chamroukhi, and G.L. McLachlan.
Bayesian estimation of movement and survival probabilities from capture-
recapture data. Cogent Mathematics & Statistics, 7, 2020.
N. Polson and L. Wasserman. Prior distributions for the bivariate binomial.
Biometrika, 77(4):901–904, 1990.
J. Press and J. M. Tamur. The Subjectivity of Scientists and the Bayesian
Approach. Wiley, 2001.
H. Raiffa and R. Schlaifer. Applied Statistical Decision Theory. Harvard
University, Boston, 1961.
C. P. Robert. The Bayesian Choice. Springer, 2001.
C.P. Robert and G. Casella. Introducing Monte Carlo Methods with R.
Springer, 2010.
L. Schwartz. On Bayes procedures. Z. Wahrscheinlichkeitstheorie, 4:10–26,
1965.
G. Schwarz. Estimating the dimension of a model. The Annals of Statistics,
6(2):461–464, 1978.
S. Searle, G. Casella, and C.E. McCulloch. Variance Components. Wiley,
2006.
S.R. Searle. Linear Models. Wiley, 1971.
S.R. Searle and M.J. Gruber. Linear Models, 2nd ed. Wiley, 2017.
G.A.F. Seber. A Matrix Handbook for Statisticians. Wiley, 2008.
332 BIBLIOGRAPHY
G.W. Snedecor and W.G. Cochran. Statistical Methods, 8th ed. Iowa State
University Press, 1989.
D.J. Spiegelhalter, N.G. Best, P.C. Bradley, and A. Van der Linde. Bayesian
measures of model complexity and fit. Journal of the Royal Statistical So-
ciety Ser. B, 64(4):583–639, 2002.
S. M. Stigler. Thomas Bayes’s Bayesian inference. Journal of the Royal Sta-
tistical Society. Ser. A, 145(2):250–258, 1982.
T.J. Tibshirani. Regression shrinkage and selection via the lasso. Journal of
the Royal Statistical Society, Ser. B, 58:267–288, 1996.
L. Tierney, Robert E.K., and J.B. Kadane. Fully exponential laplace approx-
imation to expectations and variances of nonpositive functions. Journal of
the American Statistical Association, 84(407):710–716, 1989.
A. W. van der Vaart. Asymptotic Statistics. Cambridge University Press,
1998.
A. Zellner. An Introduction to Bayesian Inference in Econometrics. Wiley,
1971.
Hui Zou and T. Hastie. Regularization and variable selection via the elastic
net. Journal of the Royal Statistical Society, Ser. B, 67(2):301–320, 2005.
S. Zwanzig and B. Mahjani. Computer Intensive Methods in Statistics. CRC
Press, 2020.
Index

Autoregressive model, 208 Dirichlet, 27, 39, 41, 62


exponential, 125, 324
Bayes decision rule, 84 exponential family, 40
Bayes factor, 222 Fisher’s z, 290
Bayes Theorem, 14 gamma, 18, 75, 125, 187, 212,
Bayesian linear models 288, 324
mixed, 166 geometric, 75, 288, 323
estimation, 162 Haldane, 289
univariate, 131 inverse-gamma, 19, 95, 145, 200,
posterior, 134, 135 324
Brute-force, 241 inverse-Wishart, 173, 327
Laplace, 57, 190, 324
Comparison of models, 228 log-normal, 75, 287
Completing the squares, 137 matrix-variate t, 174, 178, 215,
Credible region, 199 326
Cumulant generating function, 43 matrix-variate normal, 170, 215,
326
Decision, 78
multinomial, 6, 39, 41, 61, 73, 76,
admissible, 81
290
inadmissible, 81
multivariate t, 148, 325
admissible, 99, 100, 102, 103, 105
multivariate normal, 134, 325
Bayes, 101–103, 108
conditional, marginal, 132
minimax, 98–100, 105, 108
joint, 132
randomized, 97, 101
joint posterior, 146
Decision rule, 78
negative binomial, 208, 323
test, 217
noncentral chi squared, 83
Decision space, 78
normal, 6, 42, 76, 77, 82, 96, 291,
DIC, 233
324
Distribution
normal-gamma, 181, 302, 325
F, 289, 324
normal-inverse-gamma, 145, 147,
t, 186, 324
325
Bernoulli, 8
normal-inverse-Wishart, 173, 174,
beta, 16, 184, 324
327
beta-binomial, 35, 213, 268, 308,
parameter generating, 11
323
Pareto, 75, 287
binomial, 86, 102, 109, 114, 323
Poisson, 206, 323
Cauchy, 21, 75, 88, 287, 324
Poisson weighted mixture, 83
data generating, 11, 183, 269

333
334 INDEX
Wishart, 172, 327 lion’s appetite, 7, 9, 20, 31, 51,
80, 81, 84, 86, 98, 106, 109,
Entropy 203
Shannon, 51, 64, 77, 291 mixture distribution, 48, 202
Estimation multinomial distribution, 61, 73
regularized, 191 no consistency, 116, 118
Estimator normal, 87
Bayes, 115 normal distribution, 10, 12, 16,
elastic net, 191 18, 19, 30, 51, 96, 183, 185,
lasso, 190 187, 199, 200, 219, 246
least-squares normal i.i.d. sample, 94
multivariate, 172 Parkinson, 143
Maximum a Posteriori Estimator pendulum, 6
(MAP), 186, 243 random-walk Metropolis, 262
Maximum Likelihood Estimator rejection algorithm, 256
(MLE), 186 sampling importance resampling,
minimax, 98 253
regularized, 189 sequential data, 23
restricted, 189 sex ratio at birth, 218, 226, 227
ridge estimator, 190 side effects, 128
Example smokers, 144
ABC, 270, 272 systematic-scan Gibbs Sampler,
Bahadur, 118 266
big data, 23 Variational Inference, 276
billiard table, 12, 16 weather, 37, 48, 187, 194, 202
binomial distribution, 51, 52, 56, Exponential family
64, 66, 86, 91, 102, 109, 218, conjugate prior, 43, 47
225 definition, 40
bivariate binomial, 70 natural parameter, 40
capture–recapture, 25, 32 sample, 42
Cauchy distribution, 21
Cauchy i.i.d. sample, 88 Fisher information, 54
CAVI algorithm, 279
corn plants, 127, 142, 159, 172, Gamma distribution
179 conjugate prior, 44, 75
Corona, 7, 45 Gamma function, 149
flowers, 6, 39
gamma distribution, 44 Hardy–Weinberg Model, 290
hip operation, 160, 169, 198 Hellinger distance, 95, 124
importance sampling, 250 Hellinger transform, 96, 120, 124
independent MC, 246 Hyperparameter, 25, 32
James–Stein estimator, 82
Information
Laplace approximation, 244
expected information, 63
life length, 130, 139, 153, 158,
mutual information, 64
197, 204, 220
Information criterion
INDEX 335
Akaike ( AIC), 236 Markov Chain Monte Carlo
Bayesian (BIC), 232 annealing, 261
Deviance (DIC), 233 balance inequality, 260
DIC general MCMC, 258
normal linear model, 234 Gibbs Sampling, 27, 263
Schwarz, 232 MCMC, 257
Instrumental distribution, 249 Metropolis algorithm, 259
Metropolis–Hastings, 259
Jacobian, 52, 303 Metropolis–Hastings ratio, 259
Jeffreys–Lindley paradox, 230 proposal distribution, 261
random walk, 262
Kullback–Leibler divergence, 53, 76, thinning, 261
95, 120, 125, 275, 276, 291, MCMC
296 logistic regression, 320
Minimaxity, 98
Laplace approximation, 242
Model
Likelihood
Bayes model, 11
likelihood function, 8
Hardy–Weinberg, 76
matrix-variate normal, 171
hierarchical, 25
maximum likelihood principle, 9
statistical model, 5, 11
multivariate normal, 137, 147,
Monte Carlo
154
Importance Sampling (IS), 248
Linear mixed model
importance function, 249
marginal model, 161
importance weights, 249
Linear model, 126
instrumental, 249
orthogonal design, 139
trial, 249
Bayesian, 131
independent MC, 245
mixed, 159, 197
multivariate, 170 Normal distribution
univariate, 127 Gibbs sampler, 282
Location model, 56 Jeffreys prior, 60
Location scale model, 104
Location-scale model, 60 Odds ratio, 52
Logistic regression, 45
Loss Parameter
0–1 loss, 93 model indicator, 221
L1 , 79 nuisance, 70
L2 , 79 of interest, 70
absolute error, 89 Parameter space, 5, 11, 173
contrast condition, 114 Partitioned matrix, 133
Hellinger, 95 inverse, 133
Kullback–Leibler, 95 Pinskers’ inequality, 125
posterior expected, 84, 115 Posterior, 136
randomized decision, 97 precision parameter, 139
weighted quadratic, 85 joint, 166, 178
Loss function, 79 marginal, 178
336 INDEX
matrix-variate t, 178 right Haar measure, 61
normal, 143 subjective, 31
odds ratio, 222 uniform distribution, 68
precision parameter, 135 Prior distribution, 11
robustness, 119 Probability function, 7
strongly consistent, 114
Posterior distribution, 14 Reference Analysis, 63
Prediction Risk
Bayes predictor, 92, 206 Bayes, 84
prediction error, 91 integrated, 83
predictive distribution, 92, 206 frequentist, 80, 97
quadratic regression, 211
Principle Sample
Bayesian, 14, 113, 183, 199, 218, i.i.d., 12
222, 240 Sample space, 5
likelihood, 8 Scale model, 58
Principles Schur complement, 133
modelling, 29 Schwartz’ Theorem, 121
Prior Score function, 54
Berger–Bernardo method, 63, 72 Sherman-Morrison-Woodbury matrix
conjugate, 38, 131, 165 identity, 134
exponential family, 43 Spherical structure, 138
gamma distribution, 212 Strong consistency, 114
hyperparameter, 32, 34
Target distribution, 249
improper, 50
Test
invariant, 55
p-value, 104
inverse-gamma, 145
randomized, 97
Jeffreys, 54
Trial distribution, 249, 255
joint, 177
Laplace, 50 Variational inference
least favourable, 105 CAVI, 279
left Haar measure, 61 evidence lower bound (ELBO),
NIG, 147 276
non-informative, 55, 64, 153 mean-field, 275
objective, 31 variational density, 275
odds ratio, 222 variational factor, 278
reference, 63, 67 variational family, 275

You might also like