Bayesian Nonparametrics and The Probabilistic Approach To Modelling
Bayesian Nonparametrics and The Probabilistic Approach To Modelling
Bayesian Nonparametrics and The Probabilistic Approach To Modelling
1. Introduction
I will use the term forecast in a very general way to refer to the process of making
any claims about unobserved data on the basis of observed data; I will also
use predict interchangeably with forecast. For all nontrivial phenomena, forecasts
have to include some representation of the forecasting uncertainty. Deterministic
forecasts (e.g. tomorrow’s high temperature will be 17 degrees C) are too brittle
and therefore easy to falsify.
Ideally, a model should also be adaptive. By adaptive I mean that the
forecasts of the model should change depending on the data observed so far. Such
adaptation, or learning, should hopefully have the effect of making the model’s
forecasts be better aligned with actual data. For example, if the forecasts are in
the form of probability distributions, adaptation should have the effect that the
forecast probability assigned to what actually happens should increase after the
model has seen more data, although this cannot be generally guaranteed.
It is clear that all forecasts need to represent uncertainty. Observable data is
generally corrupted by noise in the measurement process, and this noise needs to
be incorporated in the forecast uncertainty. But uncertainty lurks at many other
levels in any model. The amount of noise in the measurement process may itself
be unknown. The model may have a number of parameters which are unknown.
The structure of the model itself may be uncertain, or there may be multiple
plausible competing models for the data. The forecasting system should ideally
produce forecasts that incorporate all reasonable sources of uncertainty. As we
will see, the Bayesian framework provides a natural and coherent approach for
representing and manipulating all forms of uncertainty in modelling.
1
With apologies to probabilists and statisticians my presentation and notation will eschew
formality in favour of didactic clarity.
Ghahramani 3
The expression P (D|m) is variously called the marginal likelihood, model evidence,
or integrated likelihood. The prior over the parameters, P (θ|m), plays the role
of specifying the range as described above (for example it could be uniform on
[−1, +1]) in the form of a distribution over the allowable values of the parameters.
Without a prior our model is not well-defined: we cannot generate or forecast
data until we know how to choose values for θ.2 Once the prior and likelihood
are defined, and only then, is the model m fully specified in the sense that it can
generate possible data sets.
People often object to Bayesian methods on the basis that it forces one to
define priors on the parameters. This, in my opinion, is completely misguided.
2
Of course, our model may have a fixed value of θ, e.g. 7.213, which corresponds to a delta
function prior.
4 Bayesian nonparametric modelling
• given the data, use probability theory to make inferences about any
unknown quantities in your model, or to make predictions from the model.
This process lends itself very naturally to a sequential processing of data.
Your posterior after observing some data Dold , P (θ|Dold , m) is your prior before
observing new data, Dnew :
P (θ|Dnew , Dold , m) ∝ P (Dnew |Dold , θ, m)P (θ|Dold , m).
The sum and product rule also tell us how to make predictions from a model.
Consider predicting some unknown quantity x (e.g. the next data point) given
observed data D and model m:
Z
P (x|D, m) = P (x|θ, D, m)P (θ|D, m)dθ. (2.2)
This is a very intuitively satisfying expression which tells us that our predictions or
forecasts are a weighted average of the forecasts from different parameter values,
weighted by the posterior probability of each parameter value given the data
observed so far. For parametric models, this simplifies since given the parameters
forecasts are independent of the observed data: P (x|θ, D, m) = P (x|θ, m). We will
revisit this point as we discuss parametric vs nonparametric models in Section 3.
If we are considering a number of models m, m0 , etc, then by the sum and product
rules, our forecasts are an average over models weighted by their posteriors.
The probabilistic modelling framework also provides intuitive answers to
problems in model comparison. Assuming a set of competing probabilistic models
M, given some observed data we can evaluate the posterior probability of a
particular model m,
P (D|m)P (m) P (D|m)P (m)
P (m|D) = =P 0 0
.
P (D) m0 ∈M P (D|m )P (m )
Note the prominent role played by the marginal likelihood, P (D|m).
Importantly, this marginal likelihood captures a preference for simpler models
known as Bayesian Occam’s Razor (Jefferys and Berger, 1992; MacKay, 2003;
Rasmussen and Ghahramani, 2001). Consider a set of nested models, for example
different order polynomials (e.g. constant, linear, quadratic, cubic, etc.) used to
fit some regression relationship (Figure 1). Clearly a higher order model such as
the cubic polynomial is strictly more complex than a lower order model such as
the linear polynomial. Model fitting procedures based on optimisation (such as
maximum likelihood methods or penalised likelihood methods) need to take great
care not to overfit the data by fitting the parameters of an overly complex model
to a relatively small data set. Overfitting is not a problem for fully Bayesian
methods, as there is no “fitting” of the model to the data. We only have the
sum rule and the product rule to work with, there is no “optimise” rule in
probability theory. A more complex model, say one that has more parameters,
simply spreads its predictive probability mass over more possible data sets than
a simpler model. If all models are specified as probability distributions over data
sets, since probability distributions have to sum to one, all models have the same
amount of probability mass to spread over possible data (Figure 2). Given a
particular data set, it therefore becomes possible to reject both models that are
too simple or too complex simply by using the rules of probability theory.
6 Bayesian nonparametric modelling
20 20 20 20 0.8
0 0 0 0
0.6
P(Y|M)
−20 −20 −20 −20
0 5 10 0 5 10 0 5 10 0 5 10
M=4 M=5 M=6 M=7
0.4
40 40 40 40
20 20 20 20 0.2
0 0 0 0
0
−20 −20 −20 −20
0 1 2 3 4 5 6 7
0 5 10 0 5 10 0 5 10 0 5 10 M
Figure 1. Marginal likelihoods, Occam’s razor, and overfitting: Consider modelling a function y =
f (x) + describing the relationship between some input variable x, and some output or response
variable y. (left) The red dots in the plots on the left side are a data set of eight (x, y) pairs of
points. There are many possible f that could model this given data. Let’s consider polynomials
of different order, ranging from constant (M = 0), linear (M = 1), quadratic (M = 2), etc to
seventh order (M = 7). The blue curves depict maximum liklihood polynomials fit to the data
under Gaussian noise assumptions (i.e. least squares fits). Clearly the M = 7 polynomial can fit
the data perfectly, but it seems to be overfitting wildly, predicting that the function will shoot
off up or down between neighbouring observed data points. In contrast, the constant polynomial
may be underfitting, in the sense that it might not pick up some of the structure in the data. The
green curves indicate 20 random samples from the Bayesian posterior of polynomials of different
order given this data. A Gaussian prior was used for the coefficients, and an inverse gamma
prior on the noise variance (these conjugate choices mean that the posterior can be analytically
integrated). The samples show that there is considerable posterior uncertainty given the data,
and also that the maximum likelihood estimate can be very different from the typical sample
from the posterior. (right) The normalised model evidence or marginal likelihood for this model is
plotted as a function of the model order, P (Y |M ), where the data set Y are the eight observed
output y values. Note that given the data, model orders ranging from M = 0 to M = 3 have
considerably higher marginal likelihood that other model orders, which seems plausible given
the data. Higher order models, M > 3, have relatively much smaller marginal likelihood which
is not visible on this scale. The decrease in marginal likelihood as a function of model order is
a reflection of the automatic Occam’s razor which results from Bayesian marginalisation.
Ghahramani 7
too simple
D
All possible data sets of size n
Figure 2. An illustration of Occam’s Razor. Consider all possible data sets of some fixed size n.
Competing probabilistic models correspond to alternative distributions over the data sets. Here
we have illustrated three possible models which spread their probability mass in different ways
over these possible data sets. A complex model (shown in blue) spreads its mass over many more
possible data sets, while a simple model (shown in green) concentrates its mass on a smaller
fraction of possible data . Since probabilities have to sum to one, the complex model spreads
its mass at the cost of not being able to model simple data sets as well as a simple model—this
normalisation is what results in an automatic Occam’s razor. Given any particular data set,
here indicated by the dotted line, we can use the marginal likehood to reject both overly simple
models, and overly complex models. This figure is inspired by a figure from MacKay (1991), and
an actual realisation of this figure on a toy classification problem is discussed in Murray and
Ghahramani (2005).
This approach to Bayesian model comparison can be used to solve a vast range
of problems in learning the structure of complex models. For example, it has been
used to learn the number of clusters in a mixture model (Roberts et al., 1998;
Ghahramani and Beal, 2000), finding relevant variables or features in a prediction
problem (MacKay, 1994; Neal, 1998), discovering the number of states in a hidden
Markov model (MacKay, 1997), and learning the dependency structure between
variables in a probabilistic graphical model (Friedman and Koller, 2003).
The above approach to model comparison relies on the ability to enumerate
a set of models M to be compared. This is often reasonable in scientific
settings where there are a number of competing hypotheses to explain some
phenomenon. Bayesian Occam’s razor ensures that overly complex models are
adequately penalised when doing model comparison. However, the complexity
of real-world phenomena often requires us to consider complex models that are
flexible enough to capture the structure in real data. Flexible models are not
only more realistic, but will also generally result in more reasonable forecasts
than simpler models. Bayesian nonparametrics provides a natural framework for
defining flexible models.
3. Nonparametric models
Gaussian processes (GPs) are a distribution over functions which can be used
in numerous contexts where one’s model requires one to represent an unknown
function (Rasmussen and Williams, 2006). One dimensional GPs indexed by
time are familiar to many fields: Brownian motion, Wiener processes, Ornstein-
Uhlenbeck processes, linear Gaussian state-space models, and many random
walks, are all examples of GPs. For historical reasons, GPs are also sometimes
associated with spatial statistics, for example modelling temperature as a function
of spatial location. Within machine learning, GPs are often used for nonlinear
regression and classification.
Consider the following simple model for nonlinear regression
yn = f (xn ) + n .
Here, we wish to model the relationship between an input (or covariate) x and
an output (or response variable) y. The subscript n indexes data points in our
data set D, n is some additive noise, and crucially, f is the unknown regression
function we wish to learn about from data.
7
The word “process” usually evokes the idea of something evolving temporally, but it’s
important to note that stochastic processes can be indexed by time, space, or any other index
space. Many of the uses of stochastic processes in Bayesian nonparametrics do not correspond
to indexing by time.
10 Bayesian nonparametric modelling
Conjugacy and good coverage suggest that the DP could be a very good
general purpose nonparametric density estimator. Unfortunately, the distributions
drawn from a DP prior are, with probability one, discrete so they don’t have a
density. In fact, a draw from a DP prior can be represented in the following way:
∞
X
G= πk δxk (5.1)
k=1
where the sum is an infinite sum, the πk are masses that sum to one, δ is the
Dirac-delta function, and the xk are locations for those probability masses drawn
iid from the base measure H which controls the mean of G. To alleviate the
problem of discreteness, the DP is often used as a prior on the parameters of
a mixture model, with the whole model now called a Dirichlet process mixture
(DPM) (Antoniak, 1974; Ferguson, 1983; Lo, 1984; Neal, 2000).
The particular example of a Dirichlet process mixture of Gaussians, also known
as an infinite Gaussian mixture model (Rasmussen, 2000) is widely used for both
density estimation and clustering. Clustering refers to the problem of finding
groupings of objects or data points such that similar objects belong to the same
group and dissimilar objects belong to different groups. An example application
of clustering is finding groups of similar celestial objects in a large astronomical
survey. Posed abstractly in this way, clustering is not a well-defined problem (how
many groups should we find? what does “similar” mean?). However, if we restrict
ourselves to assuming that each cluster can be captured by some parameterised
probability distribution over data points, such as a Gaussian, then clustering
becomes a well-defined probabilistic inference problem. Bayesian nonparametric
clustering using Dirichlet process mixtures is a natural extension of classical
clustering algorithms such as the EM algorithm for finite mixture models or k-
means (Duda and Hart, 1973). In a finite mixture model, the data is assumed to
come from a distribution composed of K components:
K
X
p(x|π, θ) = πk p(x|θk ) (5.2)
k=1
with mixing weights π that sum to one, and parameters θk for each component.
An infinite mixture model considers the limit of K → ∞, and has the property
that it allows the number of observed clusters to grow with the number of observed
data points. A DPM can be obtained as an infinite limit of a finite mixture model
in many ways, but let us consider the following construction:
xn ∼ p(x|θn )
θn ∼ G
G ∼ DP(α, H).
Because G is discrete, the values of θn will repeat, which results in a clustering
of the data points. By equation (5.1), the πk correspond to the mixing weights
of the infinitely many clusters, which can be compared to the finite counterpart
(equation 5.2). The distribution over partitions of the data points induced by the
DPM is known as a Chinese Restaurant Process (CRP; Aldous (1985)).
12 Bayesian nonparametric modelling
The DPM, apart from being mathematically elegant, has some practical
advantages over traditional clustering algorithms. There can be computational
advantages to running a single instance of a DPM inference algorithm which
automatically infers the number of clusters, rather than having to run multiple
instances of an algorithm to compare different hypotheses on the number of
clusters. Moreover, at test time, the DPM always allows for the possibility that
a new test point (e.g. a new astronomical object) belongs to a cluster that was
not observed in the training data. This makes DPM predictions somewhat more
robust to outliers.
Many processes of interest produce sequential data for which models that assume
exchangeability are inadequate. For example, natural language text, protein amino
acid sequences, financial time series, and natural sounds all have sequential
structure which is important to model.
One of the most widely used models for times series is the hidden Markov
model (HMM). An HMM assumes that a series of observations (x1 , . . . , xT ) was
generated by a process which can be in one of K different discrete states at each
point in time, st ∈ {1, . . . , K}. Moreover, in an HMM, st fully captures the state of
the system at time t in the sense that given st the future evolution of the system
does not depend on the state at times previous to t; this is the Markov property:
P (st+1 |s1 , . . . , st ) = P (st+1 |st ). Finally, the observations are assumed to be drawn
independently given the hidden states, through an emission process P (xt |st ).
An example of an interesting application of HMMs in the physical and
biological sciences is the modelling of single ion channel kinetics (Chung et al.,
1990). Here, the measured time series of currents from an ion channel are modelled
by assuming that at each point in time the channel macromolecule can be in one
of many different conformational states, and that there is a transition matrix
defining the probability of transitioning between each of these states.
Learning an HMM involves inferring both parameters of the transition process
P (st+1 |st ), which is in general a K × K transition matrix, and parameters of the
emission process, P (xt |st ) (Rabiner and Juang, 1986). The problem of learning
the structure of an HMM corresponds to inferring the number of hidden states,
K, from data. Rather than doing model selection over varying K, we would like
to develop a nonparametric approach to HMMs where the model has countably
infinitely many hidden states at its disposal. This can be useful both when we
don’t believe that any finite HMM can capture the data generating process well,
and in situations where we believe that the data was actually generated from
a finite-state process, but we simply don’t know how many states this process
should have.8
8
There is a subtle but important distinction between assuming an infinite model, and
assuming a finite but potentially unbounded model. The difference comes to light when we
consider what we expect to happen in the limit of infinitely large data sets. In the former
case, the posterior will have visited infinitely many distinct states, while in the latter case, the
posterior should converge on some finite number of states.
Ghahramani 13
The key insight that allows one to develop a nonparametric HMM is that
the finite HMM is a time series generalisation of finite mixture models. At each
time step, the HMM assumes that the observation xt was generated by one of
K mixture components, where st indicated the component (or cluster) used. The
only difference between HMMs and mixture models is that the mixture indicators
in an HMM depend on the ones at the previous time step.
Using this insight, Beal et al. (2002) developed the infinite HMM model
(iHMM). The basic idea was to consider a Bayesian HMM with countably
infinitely many states. The main difficulty was to define a sensible prior on the
parameters of the ∞ × ∞ transition matrix. In the finite K dimensional case,
one would typically use independent symmetric Dirichlet prior distributions for
each row of the transition matrix (where a k th row corresponds to the vector
of all outgoing transition probabilities from state k). In the infinite limit, the
independent Dirichlet prior doesn’t result in a sensible model, as under this
prior, with probability one, the HMM will keep transitioning to new states rather
than revisiting previous states. The solution developed in Beal et al (Beal et al.,
2002) was to couple the rows by using a hierarchical Dirichlet process, a solution
analogous to a reinforced urn process in probability theory (Fortini and Petrone,
2012). This work was followed up in the elegant paper by Teh et al (Teh et al.,
2006) which further developed the hierarchical Dirichlet process and proposed an
improved MCMC sampler for the iHMM.
Since the original paper, there have been a number of conceptual and
algorithmic developments of the infinite HMM. The beam sampler provides an
efficient way of sampling the iHMM by using dynamic programming forward-
backward style message passing (Van Gael et al., 2008a).9 Parallel and distributed
implementations of the iHMM allow larger scale deployments (Bratieres et al.,
2010). The block diagonal iHMM is an extension which groups the hidden
states into clusters of states, effectively hierarchically partitioning the state-space
(Stepleton et al., 2009). The infinite HMM can be extended to have a power-law
structure on the hidden states by using the Pitman-Yor process (Van Gael and
Ghahramani, 2011; Blunsom and Cohn, 2011) and has been successfully applied
to diverse problems such as language modelling (Blunsom and Cohn, 2011), and
speaker diarization (Fox et al., 2008).
One limitation of Dirichlet process mixtures, the infinite HMM, and clustering
models in general, is that each data point is modelled as belonging to one of a set
of mutually exclusive clusters. Although the nonparametric variants are flexible,
in that they allow countably infinitely many clusters, they don’t allow a data
point to belong to multiple clusters at the same time – they fundamentally define
distributions over partitions of the data.
We would like building blocks for our models that can allow overlapping
cluster membership. For example, to understand a person’s network of friendships
we really need models which can capture the idea that a person can belong
9
The beam sampler is available in the software package http://mloss.org/software/view/205/
14 Bayesian nonparametric modelling
10
20
30
objects (customers)
40
50
60
70
80
90
100
0 10 20 30 40 50
features (dishes)
Figure 3. A sample from an IBP matrix, with columns reordered. Each row has on average 10
ones. Note the logarithmic growth of nonzero columns with rows. For the “restaurant” analogy
where customers enter a buffet with infinitely many dishes you can refer to the original IBP
papers.
a de Finetti mixing distribution; for the IBP Thibaux and Jordan (Thibaux and
Jordan, 2007) showed that this is the beta process (Hjort, 1990). An important
extension of the IBP is the three-parameter model which exhibits a power law
growth in the number of clusters (Teh and Görür, 2009).
Nonparametric models which use the IBP to define sparse latent variables
have been applied to a number of different problems, as reviewed in Griffiths and
Ghahramani (2011). A very interesting application of the IBP is to the problem of
network modelling: modelling the connections between objects or entities in social,
biological and physical networks (Newman, 2010). While many models of networks
are based on the idea of discovering communities or clusters of the nodes, the IBP
allows each node to belong to multiple overlapping communities or clusters, a
property which can be exploited to obtain improved predictive performance in
tasks such as link prediction (Miller et al., 2009; Mørup et al., 2011; Palla et al.,
2012).
Figure 4 shows a useful visualisation of the relationship between a number
of models. Here we can see that the Dirichlet process mixture (section 5), the
infinite HMM (section 6) and the IBP (section 7) can all be related to each
other. Combining the key features of all models results in the infinite factorial
HMM, a Bayesian nonparametric time series model with factorial hidden structure
(Van Gael et al., 2008b).
16 Bayesian nonparametric modelling
factorial factorial
model HMM
finite
HMM
mixture
IBP ifHMM
factorial
non-param.
Figure 4. A diagram representing how some models relate to each other. We start from finite
mixture models and consider three different ways of extending them. Orange arrows correspond
to time series versions of static (iid) models. Blue arrows correspond to Bayesian nonparametric
versions of finite parameteric models. Green arrows correspond to factorial (overlapping subset)
versions of clustering (non-overlapping) models.
Note that while hierarchical clustering models are often described in terms of
hierarchies over data points, in the above examples we have seen that hierarchies
can be useful more generally over internal components of models, such as hidden
states of an HMM, or parameters of a model.
We have seen that the Dirichlet process mixture model (and the associated
Chinese Restaurant Process) can be used to define Bayesian nonparametric models
for flat clustering. Are there equivalent Bayesian nonparametric models which
result in hierarchical clusterings?
The answer is yes. Here we very briefly touch upon two frameworks for
generating hierarchies which can be used in nonparametric models. For the
nonparametric setting, the key requirement is that the models define an infinitely
exchangeable distribution over the data points. For this to hold, the models must
be projective in the sense that marginalising over the (N + 1)th point should result
in a coherent model over N points (Orbanz, 2009).
The first solution is given by Kingman’s coalescent (Kingman, 1982). This
model has been widely used for modelling population genetics (Donnelly and
Tavare, 1995) and more recently in the machine learning community for
hierarchical clustering (Teh et al., 2008). The coalescent defines distributions over
trees by starting with every data point in its own cluster and considering a process
that merges clusters backwards in time until only one cluster remains. The key
property of the coalescent is that for each pair of clusters the event that they
merge is independent of the event that any other pair merges, and the time to
this event is drawn from an exponential distribution with constant rate (which we
can set to 1 without loss of generality). This process is well-defined in the limit
of N → ∞ data points, and defines an infinitely exchangeable distribution over
points.
A second solution is given by the Dirichlet Diffusion Tree (DDT; Neal
(2003)). Like Kingman’s coalescent, the DDT also defines a tree by considering
a Markovian process evolving in time; however, the DDT starts at time t = 0
and evolves forward for 1 unit of time.11 We denote the evolution of the nth
point in time via xn (t). The first data point starts at a location x1 (0) = 0 and
follows a Brownian motion process, ending up at some point drawn marginally
from a unit variance Gaussian, x1 (1) ∼ N (0, 1). The second data point exactly
follows the path of the first data point until a divergence event occurs, after
which point its path is independent of the path of the first point. The time
to divergence is parameterised through a hazard function, an object commonly
used in survival analysis. Subsequent data points follow the paths of previous
data points, diverging according to a scaled form of the hazard function, and
when reaching a branch point choosing a branch with probability proportional
to the number of points that chose that branch before. This process defines
an exchangeable distribution over data points, parameterised by the unknown
tree. Using the DDT prior, the problem of hierarchical clustering becomes one of
inferring the unknown tree given some data.
11
Note that both in the coalescent and in the DDT “time” is an indexing variable used to
generate a tree structure, and need not correspond to a real notion of time.
18 Bayesian nonparametric modelling
Both Kingman’s coalescent and the Dirichlet diffusion tree generate binary
trees with probability one.12 A generalisation of the DDT that allows arbitrary
branching of the trees is given by the Pitman-Yor diffusion tree (PYDT; (Knowles
and Ghahramani, 2011). The process is generalised to allow, at each branch point,
for the new data point either to follow the path of one of the previous points, or
to create a new branch. Like the Pitman Yor process (Pitman and Yor, 1997) the
PYDT has two parameters controlling its branching structure. Certain settings of
these parameters result in the DDT, while other settings recover the distribution
over trees induced by Kingman’s coalescent. The tree distribution induced by
the PYDT is the multifurcating Gibbs fragmentation tree (McCullagh et al.,
2008), the most general Markovian exchangeable distribution over trees. General
distributions over hierarchies and other clustering structures can also be derived
through the elegant theory of fragmentation-coagulation processes (Teh et al.,
2011).
12
In the case of the DDT, binary trees result in all but pathological cases where the hazard
function diverges before t = 1.
Ghahramani 19
10. Inference
13
Most theoretical results on intractability focus on the worst case of a problem instance.
The real-world inference problem may in fact be much easier to approximate.
20 Bayesian nonparametric modelling
see Neal (1993)), exact sampling methods (Propp and Wilson, 1996) and
particle filtering methods (Doucet et al., 2000). Some examples of deterministic
algorithms include the Laplace approximation, variational methods (Jordan
et al., 1999), and expectation propagation (Minka, 2001). Both deterministic and
stochastic algorithms for inference can often exploit the conditional independence
relationships that exist between the variables in a model to perform the relevant
computations efficiently using local messages passed between nodes of a graphical
model (Winn and Bishop, 2005; Minka, 2005; Koller and Friedman, 2009).
A complete review of approximate inference methods is beyond the scope of
this paper, but a couple of points are worth making. All approximate inference
methods can be characterised in terms of a speed-accuracy tradeoff. Some methods
are fast but often inaccurate, while other methods are slow or, like MCMC, can
be run for increasing amounts of time to give increasingly accurate results. There
is no general rule of thumb for which approximate inference method is best—
different models and problems tend to favour different methods. However, for a
particular problem, the difference between a good choice of inference algorithm
and a poor choice can be orders of magnitude of computation. Thus, it is well
worth being familiar with a number of inference algorithms and being willing to
try several methods on a particular problem.
The field of Bayesian statistics has thrived in recent years, both due to the
availability of better inference algorithms and the dramatic growth in computing
power. Ironically, the most widely used inference method in Bayesian statistics
is MCMC, which itself is a thoroughly frequentist (non-Bayesian) method for
approximating integrals. From a Bayesian perspective, numerical computation
problems are also inference problems. In numerical integration, one is trying to
infer the value of an integral by computing the integrand at a limited number of
locations. The values of the integrand are the data, which combined with a prior
on the integrand, can result in a posterior on the value of the integral. The choice
of where to evaluate the integrand can be made using Bayesian decision theory,
clearly the random evaluations of MCMC are not an optimally efficient method
for evaluating integrals. The Bayesian approach to numerical integration is known
variously as Bayesian quadrature or Bayesian Monte Carlo, and although it is not
as widely used as MCMC it can be dramatically more efficient in situations where
evaluating the integrand is computationally costly (Diaconis, 1988; O’Hagan,
1991; Rasmussen and Ghahramani, 2003; Osborne et al., 2012).
Deriving the approximate inference equations for each new model can be
tedious and error-prone, thereby inhibiting the researcher’s ability to explore
many variations on any given model. The field of probabilistic programming
offers an exciting alternative approach to building and evaluating models. Three
notable examples of probabilistic programming frameworks are BUGS, Church,
and Infer.NET (Lunn et al., 2000; Goodman et al., 2008; Minka et al., 2010). The
basic idea is to write down a computer program defining the generative model
in a programming language that is augmented to have random variables. This
computer program that defines the generative model can then be automatically
transformed or manipulated into a program that performs approximate posterior
inference in the model. Thus the process of deriving the approximate inference
equations is automated, reducing the risk of human error. One potential
disadvantage of this approach is that the automatically derived inference code
Ghahramani 21
11. Conclusions
References
Adams, R. P., Wallach, H., and Ghahramani, Z. (2010). Learning the structure
of deep sparse graphical models. In Teh, Y. W. and Titterington, M., editors,
13th International Conference on Artificial Intelligence and Statistics, pages
1–8, Chia Laguna, Sardinia, Italy.
Aldous, D. (1985). Exchangeability and related topics. In École d’été de
probabilités de Saint-Flour XIII–1983, pages 1–198. Springer, Berlin.
Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to
Bayesian nonparametric problems. Annals of Statistics, 2(6):1152–1174.
Asai, M., McAleer, M., and Yu, J. (2006). Multivariate stochastic volatility: a
review. Econometric Reviews, 25(2):145–175.
Beal, M. J., Ghahramani, Z., and Rasmussen, C. E. (2002). The infinite hidden
Markov model. Advances in Neural Information Processing Systems, 14:577 –
584.
Blunsom, P. and Cohn, T. (2011). A hierarchical Pitman-Yor process
HMM for unsupervised part of speech induction. In Proceedings of the
49th Annual Meeting of the Association for Computational Linguistics:
Human Language Technologies-Volume 1, pages 865–874. Association for
Computational Linguistics.
Bratieres, S., Van Gael, J., Vlachos, A., and Ghahramani, Z. (2010). Scaling
the iHMM: Parallelization versus Hadoop. In Proceedings of the International
Workshop on Scalable Machine Learning and Applications, Bradford, UK. IEEE
International Conference on Computing and Information Technology.
Bru, M. (1991). Wishart processes. Journal of Theoretical Probability, 4(4):725–
751.
Chung, S. H., Moore, J. B., Xia, L., Premkumar, L. S., and Gage, P. W. (1990).
Characterization of single channel currents using digital signal processing
techniques based on hidden markov models. Phil. Trans. R. Soc. Lond. B,
329(1254):265–285.
Diaconis, P. (1988). Bayesian numerical analysis. In Statistical Decision Theory
and Related Topics IV, volume 1, pages 163âĂŞ–175. Springer-Verlag.
Donnelly, P. and Tavare, S. (1995). Coalescents and genealogical structure under
neutrality. Annual Review of Genetics, 29(1):401–421.
Ghahramani 23
Goodman, N., Mansinghka, V., Roy, D., Bonawitz, K., and Tenenbaum, J.
(2008). Church: a language for generative models. In Uncertainty in Artificial
Intelligence, volume 22, page 23.
Gouriéroux, C., Jasiak, J., and Sufana, R. (2009). The Wishart autoregressive
process of multivariate stochastic volatility. Journal of Econometrics,
150(2):167–181.