Computational Statistics With Matlab
Computational Statistics With Matlab
Computational Statistics With Matlab
Mark Steyvers
3 Graphical Models 39
3.1 A Short Review of Probability Theory . . . . . . . . . . . . . . . . . . . . . 39
3.2 The Burglar Alarm Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.1 Conditional probability tables . . . . . . . . . . . . . . . . . . . . . . 42
3.2.2 Explaining away . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.3 Joint distributions and independence relationships . . . . . . . . . . . 45
3.3 Graphical Model Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.1 Example: Consensus Modeling with Gaussian variables . . . . . . . . 47
1
CONTENTS 2
4.2.3 Example: Posterior Inference for the consensus model with normally
distributed variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.4 Example: Posterior Inference for the consensus model with contaminants 59
Exercises
This course book contains a number of exercises in which you are asked to simulate Matlab
code, produce new code, as well as produce graphical illustrations and answers to questions.
The exercises marked with ** are optional exercises that can be skipped when time is limited.
Matlab documentation
It will probably happen many times that you will need to find the name of a Matlab function
or a description of the input and output variables for a given Matlab function. It is strongly
recommended to have always have the Matlab documentation running in a separate window
for quick consultation. You can access the Matlab documentation by typing doc in the
command window. For specific help on a given matlab function, for example the function
fprintf, you can type doc fprintf to get a help screen in the matlab documentation
window or help fprintf to get a description in the matlab command window.
3
Chapter 1
Probabilistic models proposed by researchers are often too complicated for analytic ap-
proaches. Increasingly, researchers rely on computational, numerical-based methods when
dealing with complex probabilistic models. By using a computational approach, the re-
searcher is freed from making unrealistic assumptions required for some analytic techniques
(e.g. such as normality and independence).
The key to most approximation approaches is the ability to sample from distributions.
Sampling is needed to to predict how a particular model will behave under some particular
set of circumstances, and to find appropriate values for the latent variables (“parameters”)
when applying models to experimental data. Most computational sampling approaches turn
the problem of sampling from complex distributions into subproblems involving simple sam-
pling distributions. In this chapter, we will illustrate two ampling approaches: the inverse
transformation method and rejection sampling. These approaches are appropriate mostly for
the univariate case where we are dealing with single-valued outcomes. In the next chapter,
we discuss Markov chain Monte Carlo approaches that operate efficiently with multivariate
distributions.
4
CHAPTER 1. SAMPLING FROM RANDOM VARIABLES 5
Table 1.1: Examples of Matlab functions for evaluating probability density, cumulative den-
sity and drawing random numbers
1600
0.025
0.8 1400
0.02 1200
0.6
frequency
1000
pdf
cdf
0.015
800
0.4
0.01 600
0.2 400
0.005
200
0 0 0
60 80 100 120 140 60 80 100 120 140 0 50 100 150 200
x x x
Figure 1.1: Illustration of the Normal(µ, σ) distribution where µ = 100 and σ = 15.
population. The code shows how to display the probability density and the cumulative
density. It also shows how to draw random values from this distribution and how to visualize
the distribution of these random samples using the hist function. The code produces the
output as shown in Figure 1.1. Similarly, Figure 1.2 visualizes the discrete distribution
Binomial(N, θ) distribution where N = 10 and θ = 0.7. The binomial arises in situations
where a researcher counts the number of successes out of a given number of trials. For
example, the Binomial(10, 0.7) distribution represents a situation where we have 10 total
trials and the probability of success at each trial, θ, equals 0.7.
Exercises
1. Adapt the Matlab program in Listing 1.1 to illustrate the Beta(α, β) distribution where
α = 2 and β = 3. Similarly, show the Exponential(λ) distribution where λ = 2. As
discussed in the note at the beginning of this book, you can use the publish option to
organize the source code as well as the resulting figures into a single document.
2. Adapt the matlab program above to illustrate the Binomial(N, θ) distribution where
CHAPTER 1. SAMPLING FROM RANDOM VARIABLES 6
16 subplot( 1,3,1 );
17 plot( x , p , 'k−' );
18 xlabel( 'x' ); ylabel( 'pdf' );
19 title( 'Probability Density Function' );
20
21 subplot( 1,3,2 );
22 plot( x , c , 'k−' );
23 xlabel( 'x' ); ylabel( 'cdf' );
24 title( 'Cumulative Density Function' );
25
29 subplot( 1,3,3 );
30 hist( y , 20 );
31 xlabel( 'x' ); ylabel( 'frequency' );
32 title( 'Histogram of random values' );
0.3 25
0.8
Cumulative Probability
0.25
20
Frequency
Probability
0.6
0.2
15
0.15
0.4
10
0.1
0.2
0.05 5
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x x
Table 1.2: Probability of digits observed in human random digit generation experiment. The
generated digit is represented by X; p(X) and F (X) are the probability mass and cumulative
probabilities respectively. The data was estimated from subject 6, session 1, in experiment
by Treisman and Faulkner (1987).
X p(X) F (X)
0 0.000 0.000
1 0.100 0.100
2 0.090 0.190
3 0.095 0.285
4 0.200 0.485
5 0.175 0.660
6 0.190 0.850
7 0.050 0.900
8 0.100 1.000
9 0.000 1.000
Suppose we now want to mimic this process and write an algorithm that samples digits
according to the probabilities shown in Table 1.2.1. Therefore, the program should produce
CHAPTER 1. SAMPLING FROM RANDOM VARIABLES 9
a 4 with probability .2, a 5 with probability .175, etc. For example, the code in Listing 1.2
implements this process using the built-in matlab function randsample. The code produces
the illustration shown in Figure 1.2.1.
Instead of using the built-in functions such as randsample or mnrnd, it is helpful to
consider how to implement the underlying sampling algorithm using the inverse transform
method. We first need to calculate the cumulative probability distribution. In other words,
we need to know the probability that we observe an outcome equal to or smaller than
some particular value. If F (X) represents the cumulative function, we need to calculate
F (X = x) = p(X <= x). For discrete distribution, this can be done using simple summing.
The cumulative probabilities of our example are shown in the right column of Table 1.2.1. In
the algorithm, the idea is to sample uniform random deviates (i.e., random numbers between
0 and 1) and to compare each random number against the table of cumulative probabilities.
The first outcome for which the random deviate exceeds and does not equal the associated
cumulative probability corresponds to the sampled outcome. Figure 1.2.1 illustrates this
when the uniform random number, U = 0.8 leading to a sampled outcome X = 6. This
process of repeated sampling of uniform deviates and comparing these to the cumulative
distribution forms the basis for the inverse transform method for discrete variables. Note
that we are applying an inverse function, because we are doing an inverse table lookup.
2000
1500
Frequency
1000
500
0
0 1 2 3 4 5 6 7 8 9
Digit
Exercises
1. Create the Matlab program that implements the inverse tranform method for discrete
variables. Use it to sample random digits with probabilities as shown in Table 1.2.1. In
order to show that the algorithm is working, sample a large number of random digits
and create a histogram similar to Figure 1.2.1. Your program should never sample
digits 0 and 9 as they are given zero probability in the table.
** 2 One solution to the previous exercise that does not require any loops is by using the
multinomial random number generator mnrnd. Can you show how to use this function
to sample digits according to the probabilities shown in Table 1.2.1
** 3 Can you explain why the algorithm as described above might be inefficient when dealing
with skewed probability distributions? [hint: imagine a situation where the first N-1
outcomes have zero probability and the last outcome has probability one]. Can you
think of a simple change to the algorithm to improve its efficiency?
CHAPTER 1. SAMPLING FROM RANDOM VARIABLES 11
0.9
0.8
0.7
F(X) 0.6
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6 7 8 9
X
Figure 1.4: Illustration of the inverse transform procedure for generating discrete random
variables. Note that we plot the cumulative probabilities for each outcome. If we sample a
uniform random number of U = 0.8, then this yields a random value of X = 6
1. Draw U ∼ Uniform(0, 1)
2. Set X = F −1(U)
3. Repeat
Let’s illustrate this approach with a simple example. Suppose we want to sample random
numbers from the exponential distribution. When λ > 0, the cumulative density function is
F (x|λ) = 1− exp(−λx). Using some simple algebra, one can find the inverse of this function,
which is F −1 (u|λ) = −log(1−u)/λ. This leads to the following sampling procedure to sample
random numbers from a Exponental(λ) distribution:
1. Draw U ∼ Uniform(0, 1)
3. Repeat
CHAPTER 1. SAMPLING FROM RANDOM VARIABLES 12
Exercises
1. Implement the inverse transform sampling method for the exponential distribution.
Sample a large number of values from this distribution, and show the distribution of
these values. Compare the distribution you obtain against the exact distribution as
obtained by the PDF of the exponential distribution (use the command exppdf).
** 2 Matlab implements some of its own functions using Matlab code. For example, when
you call the exponential random number generator exprnd, Matlab executes a func-
tion that is stored in its own internal directories. Please locate the Matlab function
exprnd and inspect its contents. How does Matlab implement the sampling from the
exponential distribution? Does it use the inverse transform method? Note that the
path to this Matlab function will depend on your particular Matlab installation, but
it probably looks something like
C:\Program Files\MATLAB\R2009B\toolbox\stats\exprnd.m
(A) (B)
1 1
0.5 0.5
0 0
y
y
−0.5 −0.5
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
x x
Figure 1.5: Sampling points uniformly from unit circle using rejection sampling
Figure 1.6: Illustration of rejection sampling. The particular sample shown in the figure will
be rejected
CHAPTER 1. SAMPLING FROM RANDOM VARIABLES 14
Figure 1.6 illustrates the procedure. We first need to find a constant c such that cq(θ) ≥
p(θ) for all possible samples θ. The proposal function q(θ) multiplied by the constant c is
known as the comparison distribution and will always lie “on top of” our target distribution.
Finding the constant c might be non-trivial, but let’s assume for now that we can do this
using some calculus. We now draw a number u from a uniform distribution between [0, cq(θ)].
In other words, this is some point on the line segment between 0 and the height of the
comparison distribution evaluated at our proposal θ. We will reject the proposal if u > p(θ)
and accept it otherwise. If we accept the proposal, the sampled value θ is a draw from the
target distribution p(θ). Here is a summary of the computational procedure:
6. Repeat steps 3, 4, and 5 until desired number of samples is reached; each accepted
sample is a draw from p(θ)
The key to an efficient operation of this algorithm is to have as many samples accepted
as possible. This depends crucially on the choice of the proposal distribution. A proposal
distribution that is dissimilar to the target distribution will lead to many rejected samples,
slowing the procedure down.
Exercises
1. Suppose we want to sample from a Beta(α, β) distribution where α = 2 and β = 1.
This gives the probability density p(x) = 2x for 0 < x < 1. Implement a rejection
sampling algorithm in Matlab that samples from this distribution. You can use a simple
uniform proposal distribution. The constant c should be 2 in this case. Visualize the
histogram of sampled values and verify that the distribution matches the histogram
obtained by using Matlab’s betarnd sampling function. What is the percentage of
accepted samples? How might we improve the rejection sampler
** 2 The procedure shown in Figure 1.5 forms the basis for the Box-Muller method for
generating Gaussian distributed random variables. We first generate uniform coordi-
nates (x, y) from the unit circle using the rejection sampling procedure that rejects
any (x, y) pair with x2 + y 2 > 1. Then, for each pair (x, y) we evaluate the quanti-
2 +y2 ) 1/2
1/2
−2ln(x2 +y2 )
ties z1 = x −2ln(x
x2 +y2
and z 2 = y x2 +y2
. The values z1 and z2 are each
Gaussian distributed with zero mean and unit variance. Write a Matlab program that
CHAPTER 1. SAMPLING FROM RANDOM VARIABLES 15
implements this Box-Muller method and verify that the sampled values are Gaussian
distributed.
Chapter 2
The application of probabilistic models to data often leads to inference problems that re-
quire the integration of complex, high dimensional distributions. Markov chain Monte Carlo
(MCMC), is a general computational approach that replaces analytic integration by summa-
tion over samples generated from iterative algorithms. Problems that are intractable using
analytic approaches often are possible to solve using some form of MCMC, even with high-
dimensional problems. The development of MCMC is arguably the most advance in the
computational approach to statistics. While MCMC is very much an active research area,
there are now some standardized techniques that are widely used. In this chapter, we will
discuss two forms of MCMC: Metropolis-Hastings and Gibbs sampling. Before we go into
these techniques though, we first need to understand the two main ideas underlying MCMC:
Monte Carlo integration, and Markov chains.
These expectations arise in many situations where we want to calculate some statistic of a
distribution, such as the mean or variance. For example, with g(x) = x, we are calculating
the mean of a distribution. Integration or summation using analytic techniques can become
quite challenging for certain distributions. For example, the density p(x) might have a
16
CHAPTER 2. MARKOV CHAIN MONTE CARLO 17
functional form that does not lend itself to analytic integration. For discrete distributions,
the outcome space might become so large to make the explicit summation over all possible
outcomes impractical.
The general idea of Monte Carlo integration is to use samples to approximate the expec-
tation of a complex distribution. Specifically, we obtain a set of samples x(t), t = 1, . . . , N,
drawn independently from distribution p(x). In this case, we can approximate the expecta-
tions in 2.1 and 2.2 by a finite sum:
n
1X
E[g(x)] = g(x(t)) (2.3)
n t=1
In other words, we have now replaced analytic integration with summation over a suitably
large set of samples. Generally, the accuracy of the approximation can be made as accurate
as needed by increasing n. Crucially, the precision of the approximation depends on the
independence of the samples. When the samples are correlated, the effective sample size
decreases. This is not an issue with the rejection sampler discussed in the last chapter, but
a potential problem with MCMC approaches.
Exercises
1. Develop Matlab code to approximate the mean of a Beta(α, β) distribution with α = 3
and β = 4 using Monte Carlo integration. You can use the Matlab function betarnd
to draw samples from a Beta distribution. You can compare your answer with the
analytic solution: α/(α + β). [Note: this can be done with one line of Matlab code].
1. Set t = 1
CHAPTER 2. MARKOV CHAIN MONTE CARLO 18
3. Repeat
t=t+1
Sample a new value u from the transition function p(x(t) |x(t−1))
Set x(t) = u
4. Until t = T
Importantly, in this iterative procedure, the next state of the chain at t + 1 is based only
on the previous state at t. Therefore, each Markov chain wanders around the state space
and making a move to a new state that is only dependent on the last state. It is this local
dependency what makes this procedure “Markov” or “memoryless”. As we will see, this is
an important property when using Markov chains for MCMC.
When initializing each Markov chain, the chain will wander in state space around the
starting state. Therefore, if we start a number of chains, each with different initial conditions,
the chains will initially be in a state close to the starting state. This period is called the
burnin. An important property of Markov chains is that the starting state of the chain no
longer affects the state of the chain after a sufficiently long sequence of transitions (assuming
that certain conditions about the Markov chain are met). At this point, the chain is said
to reach its steady state and the states reflect samples from its stationary distribution. This
property that Markov chains converge to a stationary distribution regardless of where we
started (if certain regularity conditions of the transition function are met), is quite important.
When applied to MCMC, it allow us to draw samples from a distribution using a sequential
procedure but where the starting state of the sequence does not affect the estimation process.
Example. Figure 2.1 show an example of a Markov chain involving a (single) continuous
variable x. For the transition function, samples were taken from a Beta(200(0.9x(t−1) +
0.05), 200(1 − 0.9x(t−1) − 0.05)) distribution. This function and its constants are chosen
somewhat arbitrarily, but help to illustrate some basic aspects of Markov chains. The process
was started with four different initial values, and each chain was for continued for T = 1000
iterations. The two panels of the Figure show the sequence of states at two different time
scales. The line colors represent the four different chains. Note that the first 10 iterations
or so show a dependence of the sequence on the initial state. This is the burnin period.
This is followed by the steady state for the remainder of the sequence (the chain would
continue in the steady state if we didn’t stop it). How do we know exactly when the steady
state has been reached and the chain converges? This is often not easy to tell, especially in
high-dimensional state spaces. We will differ the discussion of convergence until later.
Exercises
1. Develop Matlab code to implement the Markov chain as described in the example.
Create an illustration similar to one of the panels in Figure 2.1. Start the Markov
CHAPTER 2. MARKOV CHAIN MONTE CARLO 19
1 1
0.8 0.8
0.6 0.6
x
x
0.4 0.4
0.2 0.2
0 10 20 30 40 50 60 70 80 90 100 0 100 200 300 400 500 600 700 800 900 1000
t t
Figure 2.1: Illustration of a Markov chain starting with four different initial conditions. The
right and left panes show the sequence of states at different temporal scales.
chain with four different initial values uniformly drawn from [0,1]. [tip: if X is a T x
K matrix in Matlab such that X(t, k) stores the state of the k th Markov chain at the
tth iteration, the command plot(X) will simultaneously display the K sequences in
different colors].
where θ(t) represents the state of a Markov chain at iteration t. The samples from the
chain, after burnin, start to reflect samples from the target distribution p(θ).
In the Metropolis procedure, we initialize the first state, θ(1) to some initial value. We
then use a proposal distribution q(θ|θ(t−1)) to generate a candidate point θ∗ that is conditional
on the previous state of the sampler. The next step is to either accept the proposal or reject
it. The probability of accepting the proposal is:
p(θ∗ )
α = min 1, (2.6)
p(θ(t−1) )
To make a decision on whether to actually accept or reject the proposal, we generate a
uniform deviate u. If u ≤ α, we accept the proposal and the next state is set equal to the
proposal: θ(t) = θ∗ . If u > α, we reject the proposal, and the next state is set equal to the
old state: θ(t) = θ(t−1) . We continue generating new proposals conditional on the current
state of the sampler, and either accept or reject the proposals. This procedure continues
until the sampler reaches convergence. At this point, the samples θ(t) reflect samples from
the target distribution p(θ). Here is a summary of the steps of the Metropolis sampler:
1. Set t = 1
3. Repeat
t=t+1
Generate a proposal θ∗ from q(θ|θ(t−1))
∗)
Evaluate the acceptance probability α = min 1, p(θp(θ
(t−1) )
4. Until t = T
Figure 2.2 illustrates the procedure for a sequence of two states. To intuitively understand
why the process leads to samples from the target distribution, note that 2.6 will always accept
a new proposal if the the new proposal is more likely under the target distribution than the
old state. Therefore, the sampler will move towards the regions of the state space where the
target function has high density. However, note that if the new proposal is less likely than
than the current state, it is still possible to accept this “worse” proposal and move toward
it. This process of always accepting a “good” proposal, and occasionally accepting a “bad”
proposal insures that the sampler explores the whole state space, and samples from all parts
of a distribution (including the tails).
A key requirement for the Metropolis sampler is that the proposal distribution is symmet-
ric, such that q(θ = θ(t)|θ(t−1) ) = q(θ = θ(t−1)|θ(t) ). Therefore, the probability of proposing
some new state given the old state, is the same as proposing to go from the new state back to
CHAPTER 2. MARKOV CHAIN MONTE CARLO 21
the old state. This symmetry holds with proposal distributions such as the Normal, Cauchy,
Student-t, as well as uniform distributions. If this symmetry does not hold, you should use
a Metropolis-Hastings sampler discussed in the next section.
A major advantage of the Metropolis sampler is that Equation 2.6 involves only a ratio of
densities. Therefore, any terms independent of θ in the functional form of p(θ) will drop out.
Therefore, we do not need to know the normalizing constant of the density or probability mass
function. The fact that this procedure allows us to sample from unnormalized distributions is
one of its major attractions. Sampling from unnormalized distributions frequently happens
in Bayesian models, where calculating the normalization constant is difficult or impractical.
Example 1. Suppose we wish to generate random samples from the Cauchy distribution
(note that here are better ways to sample from the Cauchy that do not rely on MCMC, but
we just use as an illustration of the technique). The probability density of the Cauchy is
given by:
1
f(θ) = (2.7)
π(1 + θ2)
Note that because we do not need any normalizing constants in the Metropolis sampler, we
can rewrite this to:
1
f(θ) ∝ (2.8)
(1 + θ2 )
Therefore, the Metropolis acceptance probability becomes
1 + [θ(t)]2
α = min 1, (2.9)
1 + [θ∗]2)
We will use the Normal distribution as the proposal distribution. Our proposals are generated
from a Normal(θ(t), σ) distribution. Therefore, the mean of the distribution is centered on
the current state and the parameter σ, which needs to be set by the modeler, controls the
variability of the proposed steps. This is an important parameter that we will investigate
in the the Exercises. Listing 2.1 show the Matlab function that returns the unnormalized
density of the Cauchy distribution. Listing 2.2 shows Matlab code that implements the
Metropolis sampler. Figure 2.3 shows the simulation results for a single chain run for 500
iterations. The upper panel shows the theoretical density in the dashed red line and the
histogram shows the distribution of all 500 samples. The lower panel shows the sequence of
samples of one chain.
Exercises
1. Currently, the program in Listing 2.2 takes all states from the chain as samples to
approximate the target distribution. Therefore, it also includes samples while the
chain is still “burning in”. Why is this not a good idea? Can you modify the code
such that the effect of burnin is removed?
2. Explore the effect of different starting conditions. For example, what happens when
we start the chain with θ = −30?
CHAPTER 2. MARKOV CHAIN MONTE CARLO 22
! " #
% & ' (
!
,
) * +
. / 0
!
1
Figure 2.2: Illustration of the Metropolis sampler to sample from target density p(θ). (A)
the current state of the chain is θ(t). (B) a proposal distribution around the current state is
used to generate a proposal θ∗ . (C) the proposal was accepted and the new state is set equal
to the proposal, and the proposal distribution now centers on the new state.
CHAPTER 2. MARKOV CHAIN MONTE CARLO 23
3. Calculate the proportion of samples that is accepted on average. Explore the effect of
parameter σ on the average acceptance rate. Can you explain what is happening with
the accuracy of the reconstructed distribution when σ is varied?
4. As a followup to the previous question, what is (roughly) the value of σ that leads
to a 50% acceptance rate? It turns out that this is the acceptance rate for which the
Metropolis sampler, in the case of Gaussian distributions, converges most quickly to
the target distribution.
** 6 Modify the code such that the sequences of multiple chains (each initialized differently)
are visualized simultaneously.
p(θ)
50
100
150
200
250
t
300
350
400
450
500
−30 −20 −10 0 10 20 30
θ
Figure 2.3: Simulation results where 500 samples were drawn from the Cauchy distribution
using the Metropolis sampler. The upper panel shows the theoretical density in the dashed
red line and the histogram shows the distribution of the samples. The lower panel shows the
sequence of samples of one chain
CHAPTER 2. MARKOV CHAIN MONTE CARLO 26
3. Repeat
t=t+1
Generate a proposal θ∗ from q(θ|θ(t−1))
p(θ∗ ) q(θ(t−1) |θ∗ )
Evaluate the acceptance probability α = min 1, p(θ(t−1) ) q(θ∗ |θ(t−1) )
4. Until t = T
The fact that asymmetric proposal distributions can be used allows the Metropolis-
Hastings procedure to sample from target distributions that are defined on a limited range
(other than the uniform for which Metropolis sampler can be used). With bounded vari-
ables, care should be taken in constructing a suitable proposal distribution. Generally, a
good rule is to use a proposal distribution has positive density on the same support as the
target distribution. For example, if the target distribution has support over 0 ≤ θ < ∞, the
proposal distribution should have the same support.
Exercise
1. Suppose a researcher investigates response times in an experiment and finds that the
Weibull(a, b) distribution with a = 2, and b = 1.9 captures the observed variability
in response times. Write a Matlab program that implements the Metropolis-Hastings
sampler in order to sample response times from this distribution. The pdf for the
Weibull is given by the Matlab command wblpdf. Create a figure that is analogous
to Figure 2.3. You could use a number of different proposal distributions in this case.
For this exercise, use samples from a Gamma(θ(t)τ, 1/τ ) distribution. This proposal
density has a mean equal to θ(t) so it is “centered” on the current state. The parameter
τ controls the acceptance rate of the sampler – it is a precision parameter such that
higher values are associated with less variability in the proposal distribution. Can you
find a value for τ to get (roughly) an acceptance rate of 50%? Calculate the variance of
this distribution using the Monte Carlo approach with the samples obtained from the
Metropolis-Hastings sampler. If you would like to know how close your approximation
is, use online resources to find the analytically derived answer.
1. Set t = 1
2. Generate an initial value u = (u1 , u2, . . . , uN ), and set θ (t) = u
3. Repeat
t=t+1
Generate a proposal θ ∗ from q(θ|θ (t−1))
∗) q(θ(t−1) |θ∗ )
Evaluate the acceptance probability α = min 1, p(θp(θ
(t−1) ) q(θ∗ |θ(t−1) )
Example 1 (adopted from Gill, 2008). Suppose we want to sample from the bivariate
exponential distribution
p(θ1 , θ2) = exp (−(λ1 + λ)θ1 − (λ2 + λ)θ2 − λmax(θ1 , θ2)) (2.11)
For our example, we will restrict the range of θ1 and θ2 to [0,8] and the set the constants to the
following: λ1 = 0.5, λ2 = 0.1, λ + 0.01, max(θ1 , θ2) = 8. This bivariate density is visualized
in Figure 2.4, right panel. The Matlab function that implements this density function is
shown in Listing 2.3. To illustrate the blockwise MH sampler, we use a uniform proposal
distribution, where proposals for θ1∗ and θ2∗ are sampled from a Uniform(0,8) distribution. In
other words, we sample proposals for θ ∗ uniformly from within a box. Note that with this
particular proposal distribution, we are not conditioning our proposals on the previous state
of the sampler. This is known as an independence sampler. This is actually a very poor
q(θ(t−1) |θ∗ )
proposal distribution but leads to a simple implementation because the ratio q(θ ∗ |θ(t−1) ) = 1
and therefore disappears from the acceptance ratio. The Matlab code that implements the
sampler is shown in Listing 2.4. Figure 2.4, left panel shows the approximated distribution
using 5000 samples.
Example 2. Many researchers have proposed probabilistic models for order information.
Order information can relate to preference rankings over political candidates, car brands
CHAPTER 2. MARKOV CHAIN MONTE CARLO 28
Figure 2.4: Results of a Metropolis sampler for the bivariate exponential (right) approxi-
mated with 5000 samples (left)
and icecream flavors, but can also relate to knowledge about the relative order of items
along some temporal of physical dimension. For example, suppose we ask individuals to
remember the chronological order of US presidents. Steyvers, Lee, Miller, and Hemmer
(2009) found that individuals make a number of mistakes in the ordering of presidents that
can be captured by simple probabilistic models, such as Mallows model. To explain Mallows
model, let’s say that we are looking at the first five presidents: Washington, Adams, Jefferson,
Madison, and Monroe. We will represent this true ordering by a vector ω = (1, 2, 3, 4, 5) =
(W ashington, Adams, J efferson, Madison, Monroe). Mallows model now proposes that
the remembered orderings tend to be similar to the true ordering, with very similar orderings
being more likely than dissimilar orderings. Specifically, according to Mallows model, the
probability that an individual remembers an ordering θ is proportional to:
p(θ|ω, λ) ∝ exp (−d(θ, ω)λ) (2.12)
In this equation, d(θ, ω) is the Kendall tau distance between two orderings. This distance
measures the number of adjacent pairwise swaps that are needed to bring the two orderings
into alignment. For example, if θ = (Adams, W ashington, J efferson, Madison, Monroe),
then d(θ, ω) = 1 because one swap is needed to make the two orderings identical. Similarly,
if θ = (Adams, J efferson, W ashington, Madison, Monroe), then d(θ, ω) = 2 because two
swaps are needed. Note that in Kendall tau distance, only adjacent items can be swapped.
The scaling parameter λ controls how sharply peaked the distribution of remembered order-
ings is around the true ordering. Therefore, by increasing λ, the model makes it more likely
that the correct ordering (or something similar) will be produced.
The problem is now to generate orderings θ according to Mallows model, given the true
ordering ω and scaling parameter λ. This can be achieved in very simple ways using a
Metropolis sampler. We start the sampler with θ (1) corresponding to a random permutation
CHAPTER 2. MARKOV CHAIN MONTE CARLO 29
of items. At each iteration, we then make proposals θ ∗ that slightly modify the current state.
This can be done in a number of ways. The idea here is to use a proposal distribution where
the current ordering is permuted by transposing any randomly chosen pair of items (and not
just adjacent items). Formally, we draw proposals θ ∗ from the proposal distribution
1/ N2 if S(θ ∗, θ (t−1)) = 1
∗ (t−1)
q(θ = θ |θ )= (2.13)
0 otherwise
where S(θ ∗, θ (t−1)) is the Cayley distance. This distance counts the number of transpositions
of any pair of items needed to bring two orderings into alignment (therefore, the difference
with the Kendall tau distance is that any pairwise swap counts as one, even nonadjacent
swaps). This is just a complicated way to describe a very simple proposal distribution:
just swap two randomly chosen items from the last ordering, and make that the proposed
ordering.
Because the proposal distribution is symmetric, we can use the Metropolis sampler. The
acceptance probability is
p(θ ∗|ω, λ) exp (−d(θ ∗, ω)λ)
α = min 1, = min 1, . (2.14)
p(θ (t−1)|ω, λ) exp (−d(θ (t−1), ω)λ)
The Matlab implementation of the Kendall tau distance function is shown in Listing 2.5.
The Matlab code for the Metropolis sampler is shown in Listing 2.6. Currently, the code
does not do all that much. It simply shows what the state is every 10 iterations for a total
of 500 iterations. Here is some sample output from the program:
For example, suppose we have a bivariate distribution θ = (θ1, θ2). We first initialize
(1) (1)
the sampler with some suitable values for θ1 and θ2 . At each iteration t, we first make
(t−1)
a proposal θ1∗ depending on the last state θ1 . We then evaluate the acceptance ratio
(t−1) (t−1) (t−1)
comparing the likelihood of (θ1∗, θ2 ) against (θ1 , θ2 ). Note that in this proposal,
we have only varied the first component and kept the second component constant. In the
(t−1)
next step, we make a proposal θ2∗ depending on the last state θ2 . We then evaluate the
(t) (t) (t−1)
acceptance ratio comparing the likelihood of (θ1 , θ2∗ ) against (θ1 , θ2 ). Importantly, in
this second step, we are holding the first component constant, but this is the updated value
from the first step. Therefore, what happens in the second step is conditional on what
happens in the first step. Here is a summary of the steps for a bivariate componentwise MH
sampler:
1. Set t = 1
3. Repeat
t=t+1
(t−1)
Generate a proposal θ1∗ from q(θ1|θ1 )
(t−1) (t−1)
p(θ1∗ ,θ2 ) q(θ1 |θ1∗ )
Evaluate the acceptance probability α = min 1, (t−1) (t−1) (t−1)
p(θ1 ,θ2 ) q(θ1∗ |θ1 )
4. Until t = T
around the old value of each component and has standard deviation σ. The code in Listing
2.7 shows the Matlab implementation for this sampler. The output of this code is shown in
Figure 2.5.
Figure 2.5: Results of the componentwise Metropolis sampler for the bivariate normal dis-
tribution. Left panel shows the approximation using 5000 samples. Right panel shows the
theoretical density.
Exercises
1. Modify the code for Example 1 to implement a component-wise sampler. Use a uniform
proposal distribution for both θ1 and θ2 . Visualize the distribution in the same way as
Figure 2.4. Calculate the acceptance probability separately for θ1 and θ2 . Why is the
acceptance probability for θ2 higher than for θ1 ?
2. Investigate the effect of the scaling parameter λ in Mallows model. How do the sampled
orders change as a function of this parameter?
3. Modify the Metropolis sampler for Mallows model and sample from a 10 as opposed
to 5 dimensional space of orderings.
** 4 The purpose of this exercise is to compare the approximated distribition under Mal-
lows with the exact distribution. Note that Equation 2.12 showed the probability for
Mallows model but dropped the constant of proportionality. The exact probability
1
under Mallows model is p(θ|ω, λ) = ψ(λ) exp (−d(θ, ω)λ) where ψ(λ) is a normalizing
CHAPTER 2. MARKOV CHAIN MONTE CARLO 32
21 % Our proposal is the last ordering but with two items switched
22 lasttheta = theta(:,t−1); % Get the last theta
23 % Propose two items to switch
24 whswap = randperm( L ); whswap = whswap(1:2);
25 theta star = lasttheta;
26 theta star( whswap(1)) = lasttheta( whswap(2));
27 theta star( whswap(2)) = lasttheta( whswap(1));
28
1.5 4
1 3
0.5
2
0
1
−0.5
θ2
2
0
θ
−1
−1
−1.5
−2
−2
−2.5 −3
−3 −4
−3 −2 −1 0 1 2 3 −4 −2 0 2 4
θ θ
1 1
Figure 2.6: Results of a Gibbs sampler for the bivariate normal distribution. Left panel
shows the progression of the sampler state for the first 20 iterations. Right panel shows a
scatterplot of 5000 sampled values.
Exercises
1. Implement the Gibbs sampler from Example 1 in Matlab and produce an illustration
similar to Figure 2.6. You can use the parameters µ = (0, 0) and ρ = 0.7. What
happens when ρ = −0.7?
2. Compare the output you get from the Gibbs sampler with random samples that are
drawn from Matlab’s own multivariate normal random number generator (using com-
mand mvnrnd).
Chapter 3
Graphical Models
Graphical models are a general class of probabilistic models that allow researchers to reason
with uncertain data in the presence of latent variables. Latent variables in cognitive science
can include any information that is unobservable, such as attentional states, knowledge
representations, contents of memory, and brain states. In short, anything that is relevant
to explain the observed data but was itself not observed, can become a latent variable in a
graphical model.
In a graphical model, the researcher posits certain local statistical dependencies between
the random variables that correspond to the latent variables and observed data. We will
restrict ourselves to directed graphical models, in which the statistical dependency between
random variables is based on directional relationships. These models are also known as
Bayesian networks and belief networks, although these terms all have somewhat different
connotations. Another class of graphical modes, not discussed here, are undirected graphical
models, such as Markov random fields. These models are useful when it is natural to express
the relationship between variables using undirected relationships.
It is important to realize that graphical models form a very large class of models. Many
models that are typically not described as graphical models can be reinterpreted within a
graphical modeling framework. For example, many forms of neural networks can be usefully
analyzed with graphical models. Similarly, many process models proposed by cognitive
psychologists often are framed as stochastic processes that can be couched as graphical
models.
In this chapter, we will first review some basic rules and concepts related to probability
theory. We will then go into the important concepts of graphical models that are needed to
solve inference problems.
39
CHAPTER 3. GRAPHICAL MODELS 40
resources.
In the sum rule, we can write the marginal probability of x as a sum over the joint
distribution of x and y where we sum over all possibilities of y,
X
p(x) = p(x, y). (3.1)
y
In the product rule, we can rewrite a joint distribution as a product of a conditional and
marginal probability
p(x, y) = p(y|x)p(x)
(3.2)
= p(x|y)p(y).
In the chain rule, the product rule is applied repeatedly to give expressions for the joint
probability involving more than two variables. For example, the joint distribution over three
variables can be factorized into a product of conditional probabilities:
From these rules, we can obtain Bayes rule, which plays a central role in graphical models:
p(x|y)p(y)
p(y|x) =
p(x)
(3.4)
p(x|y)p(y)
=P 0 0
y0 p(x|y )p(y )
= =
2 3
< ; < 9 F
6 4
9 < B C D 9 D 9 >
= E
C : ? 7 D 9 H < ; ;
G =
Figure 3.1: A graphical model that illustrates the conditional dependencies between vari-
ables.
In other words, if we condition on z, and now also learn about y, this is not going to change
the probability of x. It is important to realize that conditional independence between x and
y does not imply independence between x and y. We will see some examples of the difference
between independence and conditional independence in the context of graphical models.
Exercises
1. Use the chain rule to give the general factorization of a joint distribution involving 4
variables
alarm. Finally, while you are traveling in your car, you are also listening to the news on the
radio and presumably, if there was an earthquake, you would probably hear about it on the
radio. Suppose you now get a call from your neighbor, what is the probability that there is
a burglar in your house?
This example sets up a number of issues on how we should reason with uncertainty when
there are multiple variables involved. To model this situation, we can use a graphical model
as shown in Figure 3.1. There are five random variables in this example: earthquake, burglar,
alarm, neighbor call and radio report. For simplicity, let’s assume these are binary variables;
either the alarm goes off or it doesn’t, the neighbor calls, or she doesn’t, etc. The illustration
also shows how we can replace these words by letter codes (e.g., e, b, etc).
In the graphical model, the nodes are random variables and arrows indicate conditional
dependencies between variables. For example, the network encodes the intuition that the sta-
tus of the alarm depends on earthquakes and burglars, and that the neighbor’s call depends
on the alarm. It is useful to think of these conditional dependencies as causal relationships
between variables – the burglar might cause the alarm to go off, which in turn might cause
your neighbor to call. This is how Judea Pearl originally thought of graphical models – as
ways to reason probabilistically about causes and effects. However, you should keep in mind
that graphical models can also be constructed in the absence of any causal interpretation.
The most important idea behind graphical models is that the graph structure implies a
number of independence and conditional independence relations. For example, the illustra-
tion shows that we have assumed that earthquakes are independent of burglaries – one event
doesn’t cause the other event. In other words, we have implied p(e, b) = p(e)p(b). Note that
this assumption might not be true in the real world, but this is assumed in the network.
Also, we have assumed that it is the alarm that might cause the neighbor to call, and not the
burglary itself or an earthquake event. This implies a conditional independence relationship
– we have assumed that the neighbor call is independent of the burglary event, conditional
on knowing something about the alarm. In other words, we have assumed that if one already
knows about the alarm, knowing about the burglar will not change the probability of the
neighbor calling. Specifically, the network implies the following conditional independence
relationship: p(c|a, b) = p(c|a).
V W X Y Z [ N O P Q R S
\ ] ] ^ T U U
I J
_ ` a b c d e f g h i j
k k l k m
n o p q r
M K
o n p r s
n n p r t
u v w x y z { | }
~ ~
Figure 3.2: The conditional probability tables (CPTs) for the graphical model example.
might still go off even in the absence of earthquakes or burglaries (with probability 0.01).
Furthermore, we have assumed that an earthquake by itself might trigger the alarm (with
probability 0.29), that a burglary by itself more reliably triggers the alarm (with probability
0.94) and that the combination of an earthquake and burglary raises the probability to 0.98.
Furthermore, we have assumed that our neighbor is likely to call (with probability 0.90) if
the alarm goes off, but not always. In addition, with probability 0.05, the neighbor can call
even if there is no alarm (the neighbor might be calling about something else).
With all the conditional probabilities specified, we can now engage in probabilistic infer-
ence. We can calculate the probability of one event if we know about the outcomes of other
events. We can reason from causes to effects but also from effects to causes. Because our
example is so simple, we can use exact methods to calculate any probability of interest, just
by using some standard rules in probability theory, such as the sum, product and Bayes rule.
For example, suppose we want to calculate the probability that the alarm is triggered
knowing that the burglary is taking place. In other words, we need to know p(a = 1|b = 1).
Note that in this question, we have not said anything about earthquakes, another potential
trigger for of alarm and therefore, we need to sum over all possibilities regarding earthquakes.
Using the sum and product rules, we can calculate this by
Suppose we hear a radio report on an earthquake, what is the probability that an earthquake
happened, i.e., p(e = 1|r = 1)? Note that this probability is not 1 because we have assumed
that earthquakes are likely but not guaranteed to be reported. Note also that we cannot
CHAPTER 3. GRAPHICAL MODELS 44
directly read off this probability from the conditional probability table because the table
shows the probability of a report conditional on an earthquake and not the other way around.
In this case, we can use Bayes rule to “flip” the conditional probabilities as follows:
Exercises
1. Calculate the probability that the alarm is not triggered if a burglary is taking place:
p(a = 0|b = 1). For this and the following exercises, show how you derived the answer,
and do not just give a single numeric answer.
2. Calculate the probability that you get a call from your neighbor if a burglary is taking
place: p(c = 1|b = 1). Note that the answer from the previous exercise will be part of
this calculation.
3. Calculate the probability that there is a burglary if the alarm is triggered: p(b = 1|a =
1).
** 5 Calculate the probability that there is a burglary if you get a call from your neighbor:
p(b = 1|c = 1).
Exercises
1. Calculate the probabilities p(b = 1|a = 1, e = 1) and p(b = 1|a = 1). Note that the
last probability was already calculated in a previous exercise. Verify that the first
probability is much lower than the second probability.
where pak represents the realizations of variables that are parents of the kth node in the
network. Therefore, the joint distribution can be factorized as a product of probability of each
node conditional on its parents. Note that if a node k has no parents (it is a root node), the
conditional probability p(xk |pak ) becomes p(xk ). We can turn a particular factorization of a
joint distribution into a graphical representation. The key here is that sometimes, it is useful
to visually inspect the graph structure to understand the properties of a graphical model.
At other times, it is useful to inspect the implied factorization of the joint distribution.
The directed graphs we are considering are called directed acyclic graphs or DAGs. These
are subject to an important restriction: there cannot be any directed cycles in the graph,
whereby following the arrows from one node leads us to back to the same node. In other
words, there cannot be any (directed) loops in the graph.
For the alarm example, the joint distribution factorizes as follows:
Therefore, the probability of a particular set of outcomes can easily be evaluated. If we want
to know the probability of getting the neighbors call while the alarm is triggered and there is
an earthquake but no burglary and there is an earthquake report on the radio, we can write
p(a = 1, b = 0,c = 1, e = 1, r = 1)
= p(c = 1|a = 1)p(a = 1|e = 1, b = 0)p(r = 1|e = 1)p(e = 1)p(b = 0)
= (0.90)(0.29)(0.992)(0.002)(0.999)
= 0.0005173.
For small networks, it is relatively easy to just visually read off the independence relation-
ships from a network. Instead of visual inspection, one can also use the joint distribution
to derive all the conditional independence relations. For example, consider the graphical
model in Figure 3.3, panel C. Suppose we want to know whether x and y are independent
CHAPTER 3. GRAPHICAL MODELS 46
¡ ¢ £ § ¨ ©
¦ ª «
¤ ¥ ¬
conditional on z, such that p(x, y|z) = p(x|z)p(y|z). We can use the product rule to write
the conditional probability p(x, y|z) as follows:
p(x, y, z)
p(x, y|z) =
p(z)
We can replace the joint distribution in the numerator by the factorization that the network
implies:
p(x, y, z)
p(x, y|z) =
p(z)
p(x|z)p(y|z)p(z)
=
p(z)
= p(x|z)p(y|z)
Therefore, the conditional independence holds in this case.
Exercises
1. Consider the graphical model in Figure 3.3, panel A. Is this a directed acyclic graph?
Give the factorization of the joint probability.
2. For the graphical models B and D in Figure 3.3, determine if x and y are conditionally
independent of z. Also, in what cases are x and y independent?
CHAPTER 3. GRAPHICAL MODELS 47
Ç Ï
º ¹
¯ ¶
Å Æ
· ¸
® ° ± ² ³ ´ µ
È É Ê Ë Ì Ë Í
» ¼ ½ ¾ ¿ ¾ À
Figure 3.4: Examples of graphical models where shaded and unshaded nodes indicate ob-
served and unobserved variables respectively. The models in panel A are B are identical.
The plate notation in panel B provides a more compact representation of models with re-
peated sampling steps. Panel C illustrates a deterministic variable as indicated by the
double-bordered node.
Irvine, California”. Although we as experimenters know the true answer to each question,
the goal of the modeling is to estimate the true answer µ on the basis of all the estimates
given by individuals. This setup is similar to Cultural Consensus models developed by
Batchelder and colleagues, except that we apply this to questions with continuous as opposed
to discrete answers (such as true/false or multiple choice questions). In this consensus
model, we are assuming that each individual is giving N repeated estimates for the same
question. Therefore, the observed data consists of estimates yij where i = 1, . . . , N indexes
the repetition and j = 1, . . . , M indexes the individual. In the model for this task, we assume
that all estimates are Normally distributed centered around the mean µ, which is a latent
variable in the model. We also assume that each individual is associated with a separate
standard deviation σj . This allows the model to associate different levels of uncertainty for
each individual. To complete the model, we put a Gamma(a,b) prior on τj , which is the
precision associated with individual j. We map from precision to standard deviation simply
√
by σj = 1/ τj . This leads to the following model:
yij ∼ Norm(µ, σj )
√
σj = 1/ τj (3.11)
τj ∼ Gamma(a, b)
where a and b are hyperparameters in the model. The associated graphical model is shown in
Figure 3.5. Note that there two plates in the model. The inner plate indicates the repeated
samples for a given individual, and the outer plate indicates the sampling across individuals.
The fact that plates can be nested is quite helpful to create compact representations for
hierarchical models.
CHAPTER 3. GRAPHICAL MODELS 49
Ó Ô Õ Ö ×
Ø ç è
Ù Ú Û Ü Ý Ü Þ ß à á â ã â ä å
Exercises
1. Suppose in the model in Figure 3.4, panel A, the experimenter lost the observed value
y3 because of a programming error. How would this change the graphical model?
2. Write down the joint distribution p(y1, y2 , y3, y4 , µ, σ) for the model in Figure 3.4, panel
B (equivalent to the model in panel A). Instead
Q of writing one long sequence of terms,
it is helpful to use product notation (i.e., )
3. Write down the joint distribution p(µ, τj , y11, y12, . . . , yN M ) for the consensus model in
Figure 3.11. Again, use product notation.
4. For the consensus model in the example, how would the graphical model change if the
means are not latent variables but are observed instead?
5. The consensus model in the example models the situation for just a single question.
Suppose we want to expand the model to K separate questions, where each question is
associated with a single latent truth µk . To simplify the model, we assume that each
individual has an uncertainty σj regardless of what question is asked. Visualize the
corresponding graphical model.
6. Suppose we have the following model:
zi ∼ Bernoulli(1/2)
φ0 = 1/2
φi1 ∼ Gaussian(µ, λ)
µ ∼ Uniform(0.5, 1)
λ ∼ Gamma(0.001, 0.001)
φ0 z i = 0
θi =
φi1 zi = 1
ki ∼ Binomial(θi , n)
Write down the corresponding graphical model when ki and n are the only observed
variables, and i = 1, . . . , p. Note that this model was taken from the coursebook by
Lee and Wagenmakers.
Chapter 4
A standard application of graphical models is to find the distributions for the latent variables
that might explain the observed data. In a Bayesian framework, this corresponds to esti-
mating the posterior distribution over the latent variables. These inferred distributions can
be used to test theoretical assumptions about some underlying process (e.g. a cognitive pro-
cess). Another application of graphical models is to produce predictive distributions. This
can be useful to simulate what the model believes a priori the data should look like, or how
the model believes any future data should look like after learning about some observed data.
Such simulations can be useful to the modeler to test the assumptions of some theoretical
framework and to make predictions that can be tested with empirical research.
50
CHAPTER 4. APPROXIMATE INFERENCE IN GRAPHICAL MODELS 51
ð
é ê ë ì í î
ï
ü
ü ü ý þ ÿ
ñ ò ó ô ò õ ò ö ÷ ó ø ù ó ú ö ñ ô û ù ö ò ó ô ò ñ ô û ù ö ò ó ô ò õ ò ö ÷ ó ø ù ó ú ö
Figure 4.1: Three inference problems in graphical models. The question marks indicate the
variables that need to be inferred for each problem.
model. Let θ = (θ1 , . . . , θM ) be the set of latent variables in a model. The prior predictive
distribution corresponds to:
Z
p(y) = p(y, θ)dθ
Z (4.1)
= p(θ)p(y|θ)dθ
Therefore, we integrate over all possible combinations of the latent variables and weight
the predictions that flow from the latent variables with the prior probability of the latent
variables. Note that integration is replaced by summation for discrete latent variables in the
network.
In some cases, the integrals in Equation 4.1 can be evaluated analytically. In this section,
we will focus instead on a simple procedure that approximates the prior predictive using
Monte Carlo sampling. This procedure is based on ancestral sampling. The idea is very
simple. To draw one sample from the prior predictive, we start by sampling from the root
nodes in the network (these are the nodes that have no parents). If the root nodes are
constants in the model set by the researcher, we don’t sample and keep those values. In
the next step, we draw samples from the child nodes of the root nodes. In this case, we are
drawing from distributions that are conditional on the sampled values of the parent nodes.
On each subsequent step, we draw samples from nodes for which the parent nodes already
were sampled and stop when all variables have an assigned sample. This process of ancestral
sampling repeats for a number of iterations until we have drawn enough samples. The
important aspect in ancestral sampling is to always start at the root nodes at the “top” of
the network, and to work your work down to the “bottom” of the network, and at each step
to only sample from a node whose parent nodes have been sampled. Note that this method
is just based on Monte Carlo sampling and not Markov chain Monte Carlo. Therefore, there
is no burnin period – all samples drawn in this procedure can be accepted as samples from
the prior predictive distribution.
CHAPTER 4. APPROXIMATE INFERENCE IN GRAPHICAL MODELS 52
400
350
300
250
200
yij
150
100
50
−50
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Individual
Figure 4.2: Data pattern sampled from the prior predictive distribution for the consensus
model with normal distributions. The model used µ = 150, a = 0.3, b = 3.33, M = 15 and
N = 20.
CHAPTER 4. APPROXIMATE INFERENCE IN GRAPHICAL MODELS 53
Exercises
1. Consider the alarm network discussed earlier and shown in 3.2. Suppose we want to
calculate the prior predictive distribution for node c, the neighbor call. Use the ances-
tral sampling method to draw 10,000 samples from the prior predictive distribution.
On the basis of these samples, what is the predicted probability, a priori, of a neighbor
call? (** what is the exact probability?).
2. Consider the consensus model described in Equation 3.11. Suppose that µ = 150 and
that the hyperparameters a and b are set to 0.3 and 3.33 respectively. Suppose that the
number of individuals M = 15 and the number of repetitions per individual, N = 20.
Intuitively, this corresponds to a situation where 15 individuals all give 20 independent
estimates that are centered around the true answer of 150. Draw a single sample from
the prior predictive distribution over yij . You can visualize the results as in Figure 4.2.
This figure shows some possible data pattern from the prior predictive (keep in mind
that your results will look different because a different random seed is used). It shows
that some individuals are highly variable in their estimates (corresponding to a high
σj but all estimates are centered around 150. Save the resulting data pattern to a file.
You’ll need it for a later exercise.
Exercise
1. Consider again the alarm network. Suppose the observed data consists of node c, the
neighbor calling. Suppose we have just heard the phone ringing and it is the neighbor
calling. Use rejection sampling to calculate the posterior probability of a burglary,
p(b = 1|c = 1). Note that you can use the code you developed to draw samples from
the prior predictive distribution. You just have to modify it to reject samples that lead
to the outcome c = 0.
1. Set t = 1
2. Generate initial values v1, v2 for the two latent variables
(t) (t)
3. Set the initial state θ1 = v1 and θ2 = v2
4. Repeat
t=t+1
Sample v1 ∼ p(θ1 |y, θ2 = v2 )
Sample v2 ∼ p(θ2 |y, θ1 = v1 )
(t) (t)
Update the state θ1 = v1 and θ2 = v2
CHAPTER 4. APPROXIMATE INFERENCE IN GRAPHICAL MODELS 55
5. Until t = T
Markov Blankets
At the heart of the general MCMC approach for posterior inference is the conditional dis-
tribution p(θj |y, θ−j = v−j ) where we sample a value for a latent variable conditioned on
the instantiations of all other latent variables as well as the data. When calculating this
distribution, one can take advantage of the conditional independence relationships in the
graphical model. To illustrate this, look at the example graphical model in Figure 4.3, panel
A. Suppose in our sampler, we are at the step where need to sample an assignment for the
third latent variable θ3 by sampling from the conditional p(θ3|θ1 , θ2, θ4, θ5 , y1, y2). Using the
product rule, we can write this as proportional to the joint distribution
p(θ3 |θ1, θ2 , θ4, θ5, y1, y2 ) ∝ p(θ3 , θ1, θ2 , θ4, θ5, y1 , y2)
We can simplify the joint distribution using the general factorization in 3.10, giving us:
p(θ3 |θ1 , θ2, θ4, θ5 , y1, y2) ∝ p(θ1 )p(θ2)p(θ4 )p(θ3 |θ1)p(θ5 |θ2 )p(y1|θ3 , θ5)p(y2 |θ5, θ4 )
CHAPTER 4. APPROXIMATE INFERENCE IN GRAPHICAL MODELS 56
D E F
2 3 4 2 5 4
< =
* +
> ?
, -
B C
" # 8 9
0 1
& '
@ A
!
. /
6 7
: ;
$ %
( )
Figure 4.3: Illustration of Markov blankets for an example graphical model (A). The Markov
blankets for θ3 and θ5 are shown in (B) and (C) respectively.
Now, note that we are looking for a distribution for θ3. Any term on the right-hand side
that does not depend on θ3 can be removed from the equation, leading to
On the right-hand side, we only have the terms θ3 , θ1 , θ5, and y1. All other latent variables
and observed data are irrelevant in this conditional distribution. We call the variables that
are relevant for the conditional distribution the Markov blanket. Figure 4.3, panel B shows
the Markov blanket for θ3. We can use similar reasoning to derive the Markov blanket for
θ5, as shown in panel C.
Instead of going through these calculations each time to determine the Markov blanket,
we can use a simple definition. The Markov blanket for a node includes all parents, all
children and all parents of these children.
need to sample from a conditional distribution p(µ|y, τ , a, b). Again, using the conditional
independence relations, we can derive that p(µ|y, τ , a, b) = p(µ|y, τ ). This leads to the
following general MCMC procedure where the state at iteration t consists of sampled values
τ (t)and µ(t):
1. Set t = 1
6. Repeat
t=t+1
Sample v1 ∼ p(τ1 |y11, . . . , yN 1, µ = vM +1, a, b)
Sample v2 ∼ p(τ2 |y12, . . . , yN 2, µ = vM +1, a, b)
...
Sample vM ∼ p(τM |y1M , . . . , yN M , µ = vM +1, a, b)
Update the state τ (t) = v
Sample vM +1 ∼ p(µ|y, τ = τ (t))
Update the state µ(t) = vM +1
7. Until t = T
The next step is to figure out how to actually draw samples from these conditional
distributions. One solution would be to do Metropolis-Hastings for each step. However, the
structure of the model allows a Gibbs sampling solution where we can directly sample from
the conditional distribution. Using conjugacy from Bayesian statistics (which we will discuss
in a later section), the conditional distribution for the precision parameter is based on the
following distribution:
!
N 1
τj |y1j , . . . , yN j , µ, a, b ∼ Gamma a + , (4.3)
2 b+ N
P 2
i=1 (yij − µ) /2
We can also use conjugacy to derive a simple form for the conditional distribution for µ:
P P
N M
i=1 τ y
j=1 j ij 1
µ|y, τ ∼ Normal PM ,q P (4.4)
N j=1 τj M
N j=1 τj
CHAPTER 4. APPROXIMATE INFERENCE IN GRAPHICAL MODELS 58
150.8
150.6
150.4
(t)
µ
150.2
150
149.8
100 150 200 250 300 350 400
iteration
2
(t)
τ
0
100 150 200 250 300 350 400
iteration
Figure 4.4: Illustration of the Gibbs sampler for the consensus model. Top panel shows
the samples drawn for µ. Top panel shows the samples drawn for all individual precision
variables.
These sampling equations can be used in the Gibbs sampler for the model. To test the
Gibbs sampler, we sampled synthetic data using the prior predictive distribution. For this
synthetic dataset, we used µ = 150 and set the hyperparameters a and b to 0.3 and 3.33
respectively. We simulated the model with M = 15 individuals N = 20 repetitions per
individual. Figure4.4 shows the samples drawn from the posterior distribution over µ and
τ for this synthetic dataset. Note that the first 100 samples were discarded and that the
sampler was run for 400 iterations. Figure 4.5 shows the correlation between the inferred
precision parameters (averaged over the last 300 iterations) and the precision parameters
that were sampled in the prior predictive distribution. As you can observe, the model is able
to (roughly) recover the precisions for each (simulated) individual.
Exercises
1. Write a Matlab program that implements the Gibbs sampler for the consensus model.
The dataset can be based on the synthetic data you created in an earlier exercise. Use
the parameter and sampler settings as specified above. Create visualizations that are
similar to Figure4.4 and Figure 4.5.
2. What is the inferred mean when you average the sampled values for the last 300 iter-
CHAPTER 4. APPROXIMATE INFERENCE IN GRAPHICAL MODELS 59
1
10
0
10
−1
True τ 10
−2
10
−3
10
−4
10
−4 −3 −2 −1 0 1
10 10 10 10 10 10
Inferred τ
Figure 4.5: Correspondence between the inferred precisions for each (simulated) individual
and the true values that were drawn in the prior predictive distribution
ations? How does this inferred mean compare with the mean calculated by averaging
over all observed data values? When would you expect to see a major difference?
900 900
800 800
700 700
600 600
yij
yij
500 500
400 400
300 300
200 200
100 100
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Individual Individual
Figure 4.6: Illustration of the data from the consensus model with contaminants. The left
panel shows the synthetic data produced from the model where the cross symbols indicate
contaminants. The right panel shows the same data but with the inferred contaminants at
the 400th iteration from a Gibbs sampler.
xij ∼ Bernoulli(γ)
√
Norm(µ, 1/ τj ) xij = 0
yij ∼ (4.5)
Uniform(ymin , ymax ) xij = 1
τj ∼ Gamma(a, b)
Exercises
1. Create a synthethic dataset from the consensus model with contaminants. You can
keep the same parameters as before (a=0.3, b=3.33, µ = 150, M=15, N=20) and
assume that ymin =0, ymax =1000, and γ = 0.1. Visualize the data as in Figure 4.6, left
panel where the cross marks indicate the estimates associated with contaminants.
CHAPTER 4. APPROXIMATE INFERENCE IN GRAPHICAL MODELS 61
2. Consider the problem of posterior inference for the consensus model with contaminants.
It might be helpful to first consider the inference for the latent variables τj and µ.
Suppose we actually tell the inference algorithm what the true values for xij are. How
should we change the Equations 4.3 and 4.4?
3. Next we need to know how to calculate the conditional probability p(xij |yij , µ, τj ). Use
Bayes rule to find the two terms that this probability is proportional to. Write out
how this conditional probability can be evaluated when xij = 1 and xij = 0.
4. Write the program that performs the posterior inference over all latent variables τj , µ,
and xij . Use the last iteration of the sampler to visualize the inferred contaminants
similar to Figure 4.6.
** 5 Instead of taking just a single sample from the Gibbs sampler to interpret the data
(as we did in the last exercise), it is much better to average across many samples. For
example, we can calculate the probability that a particular estimate was assigned to the
contaminant route. Calculate this probability and modify the visualization to indicate
this probability for each estimate.
** 6 Consider the basic consensus model again without contaminants. We have only con-
sidered the Normal distribution for data generation. In spite of many comments in the
literature that the Normal distribution is a natural choice to model real-world data, it
is often the case that many measurements are not Normally distributed. They might
be restricted to positive numbers only or might have heavy tails, leading to extreme
measurements. Change the model to incorporate different data generating distribu-
tions, such as the Gamma distribution (which is bounded at zero) and the Student-t
distribution (which has heavy tails). With these distributions, we can no longer use
conjugacy to sample from “nice” distributions. However, we can use the Metropolis-
Hastings sampler to sample each individual latent parameter conditional on the state
for all other latent parameters. Implement such a sampler. Once you have reached
this stage, I can give you some real estimation data to test your model on.
Chapter 5
In the preceding chapters, we have investigated inference methods that allow researchers to
apply probabilistic models to data in a posthoc fashion – the modeling can start as soon as
the data collection has finished. This is known as batch processing. In many applications
however, there is no natural end point to the data collection. The data arrive gradually
over time and the goal is to draw inferences about latent variables given the data observed
up to the current point in time. In this case, we could still opt to apply batch processing
methods such as MCMC. For example, in a naive application of MCMC methods, we could
rerun the inference algorithm every time we get a new observation. Although this procedure
leads to proper inferences about the data, it is computationally inefficient. It requires the
explicit storage of all data observed to this point which might be impractical if we are dealing
with situations in which large amounts of data needs to be processed. More importantly,
processing time increases with every new observation because the batch processing procedure
is applied to all data observed up to this point. For this reason, it is desirable to use inference
methods where the processing time does not depend on time itself – each new observed data
point takes a constant amount of processing time.
Sequential Monte Carlo (SMC) methods are a general class of Monte Carlo sampling
techniques for sequential inference problems. They allow for the online processing of data
which lead to time and memory requirements that are constant in the total number of ob-
servations received over time. A typical application for SMC methods are tracking problems
(also known as filtering problems) in robotics and vision, where the goal is to estimate the
state of a system that changes over time using a sequence of noisy measurements. For exam-
ple, in object tracking, the goal is to continually track the position of an object on the basis
of noisy measurements about the object’s position. In this case, it is not sufficient to only
use the latest measurement in the inference about the current position. Objects typically
move smoothly over time so all recent measurements are useful to make inferences about the
current position.
Recently, SMC techniques have also been introduced in other areas of cognitive science
including language, decision-making and categorization. For example, Pearl, Goldwater and
Steyvers have recently used SMC methods for online word segmentation problems, where
62
CHAPTER 5. SEQUENTIAL MONTE CARLO 63
the problem is to extract words from a continuous stream of speech. Brown and Steyvers
applied SMC methods such as particle filters to model decision-making strategies in non-
stationary environments. Griffiths and Sanborn investigated how particle filters can explain
order effects observed in categorization tasks. Finally, Vul has applied particle filters to
model human performance in an object-tracking task.
In this chapter, we will discuss two SMC methods to estimate the hidden state of a
dynamically changing system: decayed MCMC and particle filters. As the name suggests,
sequential Monte Carlo approaches are based on sampling – the hidden state can be estimated
on the basis of a set of samples that approximate posterior distributions. We will not go into
methods such as Kalman filters and Extended Kalman filters because they make Gaussian
assumptions about state transitions and measurement noise that are often unrealistic. In
addition, the computational overhead to estimate Kalman and Extended Kalman filters
might also be prohibitive in some instances. One desirable feature of SMC approaches such
as particle filters is that the approximation can be made arbitrarily precise by varying the
number of particles. For some low-dimensional problems, very good approximations can
be obtained with a small number of particles (e.g. 50) leading to computationally efficient
procedures.
Exercise
1. To get a good sense of the tracking problem in robotics and vision, watch the YouTube
videos named “Particle filter tracking with adaptive parameters estimation” and “SIR
particle filter using color statistics”. If you want to see a funny tracking application
involving belly dancing, do a search for “Human Motion Tracking - Belly”.
G H K L O P S T
I J M N Q R U V
measurements are conditionally independent given the system state. Finally, to complete
the model, we also need to specify a prior (x0) over states at the start of measurement.
Figure 5.2 shows a synthetic dataset produced by the model over 200 time steps when a = 0.1,
b = 0.8. Note that smiles are more likely in the happy state but can also occasionally occur
in the sad state. Also, notice how the latent state changes 14 times over a period of 200
trials.
CHAPTER 5. SEQUENTIAL MONTE CARLO 65
State sequence
happy
sad
0 20 40 60 80 100 120 140 160 180 200
Observation sequence
smile
grimace
Figure 5.2: Sequence of hidden states and observed outcomes produced by an HMM
Exercises
1. Write a Matlab program that generates a synthetic dataset from the HMM model spec-
ified above. Use a = 0.1, b = 0.8 and generate an output sequence for 200 iterations.
Visualize results in a way similar to Figure 5.2. Save the resulting output sequence as
well as hidden state sequence to a file.
2. Repeat the last exercise but use the Matlab function hmmgenerate to create the
output sequence.
happy
sad
0 20 40 60 80 100 120 140 160 180 200
Estimated State Sequence from Viterbi Algorithm
happy
sad
0 20 40 60 80 100 120 140 160 180 200
time
Figure 5.3: Estimated state sequence for example HMM compared with the true state se-
quence
Exercises
1. Write a Matlab program that estimates the most likely state sequence from a synthetic
dataset produced in the last exercise. Matlab implements the Viterbi algorithm with
the function hmmviterbi.
2. Suppose we measure the accuracy of the estimate by the percentage of cases where
the estimated state matches the true state. What is the accuracy according to this
measure?
Exercises
1. In this exercise, we will explore the use of the Viterbi algorithm as a batch processing
approach to the filtering task. Change the Matlab code from the last exercise to
implement the Bayesian filtering task. Therefore, simulate the case where the data
arrives sequentially over time and at each time, the Viterbi algorithm is applied to all
data observed up to time t where t varies from 1 to 200. You can use the estimate of
the last hidden state as a estimate of the filtered state. Note that you will therefore
CHAPTER 5. SEQUENTIAL MONTE CARLO 67
apply the Viterbi algorithm exactly 200 times. Visualize the filtered estimates over
time by plotting the sequence of estimated hidden states up to t as a function of t
(essentially you are creating a triangular display). Compare this with the true state
sequence.
2. Estimate the time it takes at each iteration of the algorithm. What is (roughly) the
relationship between the size of the dataset t and the time it takes to get a filtered
estimate with the Viterbi algorithm? What is the problem highlighted by this rela-
tionship?
3. Measure the accuracy in the filtering task defined by the percentage of cases where the
filtered estimate matches the true hidden state. Why is the accuracy in this case lower
than in the exercise where the Viterbi algorithm was applied once to the full dataset.
latent state, we can take the mode or mean of this distribution. The SIR particle filter as
described below is desirable for its simplicity. It only requires that one can sample from the
state transition distribution p(xt |xt−1) and can evaluate the likelihood p(yt |xt).
We initialize the particle filter by sampling each particle from the prior distribution
p(x0 ). At each iteration, we then evolve the set of particles from the last generation. First,
we sample for each particle from the last generation a candidate hypothesis for the current
latent state. Therefore, each particle leads to a projection of the hidden state to the current
time step. We then calculate the likelihood of the current observation yt under each particle.
This likelihood serves as the importance weight. We then normalize the importance weights
across particles. In the second step, we evolve the set of particles from the last generation
by sampling, with replacement, from the candidate hypotheses weighted by the importance
weights. Therefore, candidate hypotheses that make the current observations likely lead to
higher importance weights and are more likely to be sampled. This resampling continues
until M new particles are generated. The procedure can be summarized as follows:
2. Repeat
t=t+1
For i=1:M
Generate a proposal x∗it ∼ p(xt |xit−1 )
Calculate a weight wti = p(yt |x∗it)
End
0
Normalize weights, wti = wti / wti
P
i0 =1,...,M
Sample M new particles xit with replacement from {x∗it} with weights wti
3. Until t = T
Note that in the resampling step, the same proposal can be sampled multiple times. Also,
a particular proposal might not be sampled at all. Therefore, each particle can lead to a
varying number of “offspring” depending on the proximity of particle to the region in state
space that best explains the current observation – bad particles die out and are taken over
by particles that do (at least for the current time step) explain the data.
In this particular version of SIR, the proposal distribution equals the state transition
distribution. This version of SIR is also known as the bootstrap or condensation algorithm.
It is important to realize that this choice of proposal distribution can lead to problems with
high-dimensional cases. This is because in the first step of the SIR approach, we sample
proposals from the state transition distribution without taking the current observation into
account. The current observation is taken into account only in the second step, by weighting
each particle by the likelihood. Therefore, we are blindly proposing how the current hidden
state evolves from the last state and then correct only afterwards for any unlikely proposals.
CHAPTER 5. SEQUENTIAL MONTE CARLO 69
happy
sad
happy
sad
40
Particles
20
0
0 20 40 60 80 100 120 140 160 180 200
time
Figure 5.4: Predictions from the Sequential Importance Resampling (SIR) particle filter with
M = 50 particles. The bottom panel shows the distribution over particles at each time step.
For the middle panel, we have taken the mode of the distribution as the point estimate for
the hidden state.
In some cases, this leads to a situation where none of the candidate proposals come near
the regions in the state space that can adequately explain the current observation. In other
versions of SIR particle filters, more complex proposal distributions are used that do take
the current observation into account (in the proposal generation). Also, some authors have
proposed methods such as unscented Kalman filters to deal with this problem.
Figure 5.4 shows the results of the SIR filter applied to the example data. The number of
particles, M was set to 50. The bottom panel shows the distribution over particles for each
iteration. The middle panel shows the mode of the particles to create a point estimate for
the latent state at each time t. Importantly, remember that in the filtering task, a filtering
estimate at time t implies that all data up to and including time t has been observed.
2. Repeat
t=t+1
Set the particle count m=0
Repeat
Sample a random particle i ∼ Uniform(1, . . . , M)
Sample a proposal x∗ ∼ p(xt |xit−1 )
Sample u ∼ Uniform(0, 1)
If u < p(yt |x∗), accept the proposal, set m = m + 1, and set xm
t = x
∗
Until m = M
3. Until t = T
Exercises
1. Write a Matlab program that implements the SIR particle filter and the direct sim-
ulation method. Apply each particle filter with M = 50 particles to the emotional
states dataset. Use the particle filters to estimate the hidden state at each time t up
to t = 200. You can assume a uniform prior for p(x0 ).