An Introduction To Probabilistic Programming: Jan-Willem Van de Meent
An Introduction To Probabilistic Programming: Jan-Willem Van de Meent
An Introduction To Probabilistic Programming: Jan-Willem Van de Meent
Programming
arXiv:1809.10756v1 [stat.ML] 27 Sep 2018
Abstract 1
Acknowledgements 3
1 Introduction 8
1.1 Model-based Reasoning . . . . . . . . . . . . . . . . . . . 10
1.2 Probabilistic Programming . . . . . . . . . . . . . . . . . 21
1.3 Example Applications . . . . . . . . . . . . . . . . . . . . 26
1.4 A First Probabilistic Program . . . . . . . . . . . . . . . . 29
3 Graph-Based Inference 51
3.1 Compilation to a Graphical Model . . . . . . . . . . . . . 51
3.2 Evaluating the Density . . . . . . . . . . . . . . . . . . . 66
3.3 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . 74
3.4 Hamiltonian Monte Carlo . . . . . . . . . . . . . . . . . . 80
3.5 Compilation to a Factor Graph . . . . . . . . . . . . . . . 89
3.6 Expectation Propagation . . . . . . . . . . . . . . . . . . 94
8 Conclusion 201
References 205
Abstract
1
Abstract 2
We would like to thank the very large number of people who have read
through preliminary versions of this manuscript. Comments from the
reviewers have been particularly helpful, as well as general interactions
with David Blei and Kevin Murphy in particular. Some people we would
like to individually thank are, in no particular order, Tobias Kohn, Rob
Zinkov, Marcin Szymczak, Gunes Baydin, Andrew Warrington, Yuan
Zhou, and Celeste Hollenbeck, as well as numerous other members of
Frank Wood’s Oxford and UBC research groups graciously answered
the call to comment and contribute.
We would also like to acknowledge colleagues who have contributed
intellectually to our thinking about probabilistic programming. First
among these is David Tolpin, whose work with us at Oxford decisively
shaped the design of the Anglican probabilistic programming language,
and forms the basis for the material in Chapter 6. We would also like
to thank Josh Tenenbaum, Dan Roy, Vikash Mansinghka, and Noah
Goodman for inspiration, periodic but important research interactions,
and friendly competition over the years. Chris Heunen, Ohad Kammar
and Sam Staton helped us to understand subtle issues about the se-
mantics of higher-order probabilistic programming languages. Lastly
we would like to thank Mike Jordan for asking us to do this, providing
the impetus to collate everything we thought we learned while having
put together a NIPS tutorial years ago.
3
Acknowledgements 4
Grammars
5
Notation 6
E = (* v v) An expression literal.
E 0 = E[v := c] = (* c c) An expression in which a constant
c replaces the variable v.
free-vars(e) The free variables in an expression.
A = {(u1 , v1 ), ..., (u|A| , v|A| )} The directed edges (ui , vi ) between par-
ents ui ∈ V and children vi ∈ V .
P = [v1 7→ E1 , ..., v|V | 7→ E|V | ] The probability mass or density for each
variable vi , represented as a target lan-
guage expression P(vi ) = Ei
Y = [y1 7→ c1 , ..., y|Y | 7→ c|Y | ] The observed values Y(yi ) = ci .
Factor Graphs
Probability Densities
8
9
by the real world process then there is evidence that the theory can be
falsified. This describes science to a large extent. Good theories take the
form of models that can be used to make testable predictions. We can
test those predictions and falsify model variants that fail to replicate
observed statistics.
Models also can be used to make decisions. For instance when
playing a game you either consciously or unconsciously use a model of
how your opponent will play. To use such a model to make decisions
about what move to play next yourself, you simulate taking a bunch
of different actions, then pick one amongst them by simulating your
opponent’s reaction according to your model of them, and so forth
until reaching a game state whose value you know, for instance, the
end of the game. Choosing the action that maximizes your chances
of winning is a rational strategy that can be framed as model-based
reasoning. Abstracting this to life being a game whose score you attempt
to maximize while living requires a model of the entire world, including
your own physical self, and is where model-based probabilistic machine
learning meets artificial intelligence.
A useful model can take a number of forms. One kind takes the form
of a reusable, interpretable abstraction with a good associated infer-
ence algorithm that describes summary statistic or features extracted
from raw observable data. Another kind consists of a reusable but non-
interpretable and entirely abstract model that can accurately generate
complex observable data. Yet another kind of model, notably models in
science and engineering, takes the form of a problem-specific simulator
that describes a generative process very precisely in engineering-like
terms and precision. Over the course of this introduction it will be-
come apparent how probabilistic programming addresses the complete
spectrum of them all.
All model types have parameters. Fitting these parameters, when
few, can sometimes be performed manually, by intensive theory-based
reasoning and a priori experimentation (the masses of particles in
the standard model), by measuring conditional subcomponents of a
simulator (the compressive strength of various concrete types and their
action under load), or by simply fiddling with parameters to see which
values produce the most realistic outputs.
1.1. Model-based Reasoning 13
x ∼ Beta(α, β)
y ∼ Bernoulli(x) (1.1)
X Y
scene description image
simulation simulator output
program source code program return value
policy prior and world simulator rewards
cognitive decision making process observed behavior
hard, particularly if you think about writing a simulator that only needs
to stochastically generate reasonably plausible scene graphs. Noting
that P (X, Y ) = P (Y |X)P (X) then all we need is a way to go from
scene graph to observable image and we have a complete description of
a joint distribution. There are many kinds of renderers that do just this
and, although deterministic in general, they are perfectly fine to use
when specifying a joint distribution because they map from some latent
scene description to observable pixel space and, with the addition of
some image-level pixel noise reflecting, for instance, sensor imperfections
or Monte-Carlo ray-tracing artifacts, form a perfectly valid likelihood.
An example of this “vision as inverse graphics” idea (Kulkarni
et al., 2015b) appearing first in Mansinghka et al. (2013) and then
subsequently in Le et al. (2017b,a) took the image Y to be a Captcha
image and the scene description X to include the obscured string. In
all three papers the point was not Captcha-breaking per se but instead
demonstrating both that such a model is denotable in a probabilistic
programming language and that such a model can be solved by general
purpose inference.
Let us momentarily consider alternative ways to solve such a “Captcha
problem.” A non-probabilistic programming approach would require
gathering a very large number of Captchas, hand-labeling them all,
then designing and training a neural network to regress from the image
to a text string (Bursztein et al., 2014). The probabilistic program-
ming approach in contrast merely requires one to write a program that
generates Captchas that are stylistically similar to the Captcha family
one would like to break – a model of Captchas – in a probabilistic
programming language. Conditioning such a model on its observable
output, the Captcha image, will yield a posterior distribution over text
strings. This kind of conditioning is what probabilistic programming
evaluators do.
Figure 1.1 shows a representation of the output of such a conditioning
computation. Each Captcha/bar-plot pair consists of a held-out Captcha
image and a truncated marginal posterior distribution over unique string
interpretations. Drawing your attention to the middle of the bottom row,
notice that the noise on the Captcha makes it more-or-less impossible to
tell if the string is “aG8BPY” or “aG8RPY.” The posterior distribution
1.1. Model-based Reasoning 17
Fig. 4. Posteriors of real Facebook and Wikipedia Captchas. Conditioning on each Captcha, we show an approximate posterior produced by a set of weighted
importance sampling particles {(wm , x(m) )}m=1
M =100 .
the symbol x and encode it using a real positive number between 0 and
1 inclusive, i.e. x ∈ R ∩ [0, 1]. Then using standard definitions for the
distributions indicated by the joint denotation in Equation (1.1) we can
write
Γ(α + β) α−1
p(x, y) = xy (1 − x)1−y x (1 − x)β−1 (1.3)
Γ(α)Γ(β)
and then use rules of algebra to simplify this expression to
Γ(α + β) y+α−1
p(x, y) = x (1 − x)β−y . (1.4)
Γ(α)Γ(β)
Note that we have been extremely pedantic here, using words like
“symbol,” “denotes,” “encodes,” and so forth, to try to get you, the
reader, to think in advance about other ways one might denote such
a model and to realize if you don’t already that there is a fundamen-
tal difference between the symbol or expression used to represent or
denote a meaning and the meaning itself. Where we haven’t been pedan-
tic here is probably the most interesting thing to think about: What
does it mean to use rules of algebra to manipulate Equation (1.3) into
Equation (1.4)? To most reasonably trained mathematicians, apply-
ing expression transforming rules that obey the laws of associativity,
commutativity, and the like are natural and are performed almost un-
consciously. To a reasonably trained programming languages person
these manipulations are meta-programs, i.e. programs that consume and
output programs, that perform semantics-preserving transformations on
expressions. Some probabilistic programming systems operate in exactly
this way (Narayanan et al., 2016). What we mean by semantics preserv-
ing in general is that, after evaluation, expressions in pre-simplified and
post-simplified form have the same meaning; in other words, evaluate
to the same object, usually mathematical, in an underlying formal
language whose meaning is well established and agreed. In probabilistic
programming semantics preserving generally means that the mathe-
matical objects denoted correspond to the same distribution (Staton
et al., 2016). Here, after algebraic manipulation, we can agree that,
when evaluated on inputs x and y, the expressions in Equations (1.3)
and (1.4) would evaluate to the same value and thus are semantically
equivalent alternative denotations. In Chapter 7 we touch on some of the
1.1. Model-based Reasoning 19
which is equivalent to
1.1.3 Query
Either way, having such a handle on the resulting posterior distribution,
density function or method for drawing samples from it, allows us to
ask questions, “queries” in general. These are best expressed in integral
form as well. For instance, we could ask: what is the probability that
the bias of the coin is greater than 0.7, given that the coin came up
heads? This is mathematically denoted as
Z
p(x > 0.7|y = 1) = I(x > 0.7)p(x|y = 1)dx (1.10)
Intuition
Inference
Output Observations y
tions from machine learning using the inference algorithms and theory
from statistics.
Probabilistic programming is about doing statistics using the tools
of computer science. Computer science, both the theoretical and en-
gineering discipline, has largely been about finding ways to efficiently
evaluate programs, given parameter or argument values, to produce
some output. In Figure 1.2 we show the typical computer science pro-
gramming pipeline on the left hand side: write a program, specify the
values of its arguments or situate it in an evaluation environment in
which all free variables can be bound, then evaluate the program to
produce an output. The right hand side illustrates the approach taken
to modeling in statistics: start with the output, the observations or data
Y , then specify a usually abstract generative model p(X, Y ), often de-
noted mathematically, and finally use algebra and inference techniques
to characterize the posterior distribution, p(X | Y ), of the unknown
quantities in the model given the observed quantities. Probabilistic
programming is about performing Bayesian inference using the tools of
computer science: programming language for model denotation and sta-
tistical inference algorithms for computing the conditional distribution
of program inputs that could have given rise to the observed program
output.
1.2. Probabilistic Programming 23
implementors will take. That said the real substance of this tutorial is
entirely language agnostic and the main points should be understood
in this light.
We have left off of the preceding extensive list of languages both
one important class of language – probabilistic logic languages ((Kim-
mig et al., 2011),(Sato and Kameya, 1997) – and sophisticated, useful,
and widely deployed libraries/embedded domain-specific languages for
modeling and inference (Infer.NET (Minka et al., 2010a), Factorie (Mc-
Callum et al., 2009), Edward (Tran et al., 2017), PyMC3 (Salvatier
et al., 2016)). One link between the material presented in this tuto-
rial and these additional languages and libraries is that the inference
methodologies we will discuss apply to advanced forms of probabilistic
logic programs (Alberti et al., 2016; Kimmig and De Raedt, 2017) and,
in general, to the graph representations constructed by such libraries.
In fact the libraries can be thought of as compilation targets for ap-
propriately restricted languages. In the latter case strong arguments
can be made that these are also languages in the sense that there is
an (implicit) grammar, a set of domain-specific values, and a library
of primitives that can be applied to these values. The more essential
distinction is the one we have structured this tutorial around, that being
the difference between static languages in which the denoted model can
be compiled to a finite-node graphical model and dynamic languages in
which no such compilation can be performed.
Constrained Simulation
Program Induction
systems. The latter, in particular, has been used to study the mutually-
recurisive reasoning among multiple agents. A number of examples
on this are detailed in an excellent online tutorial (Goodman and
Stuhlmüller, 2014).
In this and the next two chapters of this tutorial we will present the key
ideas of probabilistic programming using a carefully designed first-order
probabilistic programming language (FOPPL). The FOPPL includes
most common features of programming languages, such as conditional
statements (e.g. if) and primitive operations (e.g. +,-, etc.), and user-
defined functions. The restrictions that we impose are that functions
must be first order, which is to say that functions cannot accept other
functions as arguments, and that they cannot be recursive.
These two restrictions result in a language where models describe
distributions over a finite number of random variables. In terms of
expressivity, this places the FOPPL on even footing with many existing
languages and libraries for automating inference in graphical models with
finite graphs. As we will see in Chapter 3, we can compile any program
in the FOPPL to a data structure that represents the corresponding
graphical model. This turns out to be a very useful property when
reasoning about inference, since it allows us to make use of existing
theories and algorithms for inference in graphical models.
A corollary to this characteristic is that the computation graph of
any FOPPL program can be completely determined in advance. This
31
2.1. Syntax 32
v ::= variable
c ::= constant value or primitive operation
f ::= procedure
e ::= c | v | ( let [v e1 ] e2 ) | ( if e1 e2 e3 )
| (f e1 . . . en ) | (c e1 . . . en )
| ( sample e) | ( observe e1 e2 )
q ::= e | ( defn f [v1 . . . vn ] e) q
Language 2.1: First-order probabilistic programming language (FOPPL)
2.1 Syntax
The remaining two forms are what makes the FOPPL a probabilistic
programming language:
Some things to note about this language are that it is simple, i.e. the
grammar only has a small number of special forms. It also has no
input/output functionality which means that all data must be inlined in
the form of an expression. However, despite this relative simplicity, we
will see that we can express any graphical model as a FOPPL program.
At the same time, the relatively small number of expression forms makes
it much easier to reason about implementations of compilation and
evaluation strategies.
Relative to other Lisp dialects, the arguably most critical charac-
teristic of the FOPPL is that, provided that all primitives halt on all
possible inputs, potentially non-halting computations are disallowed; in
fact, there is a finite upper bound on the number of computation steps
and this upper bound can be determined in compilation time. This
design choice has several consequences. The first is that all data needs to
be inlined so that the number of data points is known at compile time.
A second consequence is that FOPPL grammar precludes higher-order
functions, which is to say that user-defined functions cannot accept
other functions as arguments. The reason for this is that a reference
to user-defined function f is in itself not a valid expression type. Since
arguments to a function call must be expressions, this means that we
cannot pass a function f 0 as an argument to another function f .
Finally, the FOPPL does not allow recursive function calls, although
2.1. Syntax 35
the syntax does not forbid them. This restriction can be enforced via
the scoping rules in the language. In a program q of the form
( defn f1 . . .) ( defn f2 . . .) e
we can call f1 inside of f2 , but not vice versa, since f2 is defined after
f1 . Similarly, we impose the restriction that we cannot call f1 inside
f1 , which we can intuitively think of as f1 not having been defined yet.
Enforcing this restriction can be done using a pre-processing step.
A second distinction between the FOPPL relative to other Lisp is
that we will make use of vector and map data structures, analogous to
the ones provided by Clojure:
The fact that the FOPPL only provides a small number of expression
types is a big advantage when building a probabilistic programming
system. We will see this in Chapter 3, where we will define a translation
from any FOPPL program to a Bayesian network using only 8 rules
(one for each expression type). At the same time, for the purposes of
writing probabilistic programs, having a small number of expression
types is not always convenient. For this reason we will provide a number
of alternate expression forms, which are referred to as syntactic sugar,
to aid readability and ease of use.
We have already seen two very simple forms of syntactic sugar:
[. . .] is a sugared form of (vector . . .) and {. . .} is a sugared form
for (hash-map . . .). In general, each sugared expression form can be
desugared, which is to say that it can be reduced to an expression in the
grammar in Language 2.1. This desugaring is done as a preprocessing
step, often implemented as a macro rewrite rule that expands each
sugared expression into the equivalent desugared form.
Note that this syntax looks very similar to that of the let form. However,
whereas let binds each variable to a single value, the foreach form
associates each variable vi with a sequence ei and then maps over the
values in this sequence for a total of c steps, returning a vector of results.
If the length of any of the bound sequences is less than c, then let form
will result in a runtime error.
With the foreach form, we can rewrite Program 2.2 without having
to make use of the helper function observe-data
( let [ y-values [2.1 3.9 5.3 7.7 10.2]
slope ( sample ( normal 0.0 10.0))
intercept ( sample ( normal 0.0 10.0))]
( foreach 5
[ x ( range 1 6)
y y-values ]
( let [ fx ( + ( * slope x ) intercept )]
2.2. Syntactic Sugar 40
( let [v2 (f 2 v1 a1 . . . an )]
.
.
.
( let [vc−1 (f ( - c 1) vc−2 a1 . . . an )]
vc−1 ) · · · )))
where v0 , . . . , vc−1 and a0 , . . . , an are fresh variables. Note that the loop
sugar computes an iteration over a fixed set of indices.
To illustrate how the loop form differs from the foreach form, we
show a new variant of the linear regression example in Program 2.3. In
this version of the program, we not only observe a sequence of values yn
according to a normal centered at f (xn ), but we also compute the sum
of the squared residuals r2 = 5n=1 (yn − f (xn ))2 . To do this, we define
P
squared residuals.
In summary, the difference between loop and foreach is that loop
can be used to accumulate a result over the course of the iterations. This
is useful when you want to compute some form of sufficient statistics,
filter a list of values, or really perform any sort of computation that
iteratively builds up a data structure. The foreach form provides a much
more specific loop type that evaluates a single expression repeatedly
with different values for its variables. From a statistical point of view,
we can thing of loop as defining a sequence of dependent variables,
whereas foreach creates conditionally independent variables.
2.3 Examples
b_0 ( foreach 10 []
( foreach 1 [] ( sample weight-prior )))
b_1 ( foreach 10 []
( foreach 1 [] ( sample weight-prior )))
b_2 ( foreach 1 []
( foreach 1 [] ( sample weight-prior )))
# data
list ( t = c (94.3 , 15.7 , 62.9 , 126 , 5.24 ,
31.4 , 1.05 , 1.05 , 2.1 , 10.5) ,
y = c (5 , 1 , 5 , 14 , 3 , 19 , 1 , 1 , 4 , 22) ,
N = 10)
# inits
list ( a = 1 , b = 1)
# model
{
for ( i in 1 : N ) {
theta [ i ] ~ dgamma (a , b )
l [ i ] <- theta [ i ] * t [ i ]
y [ i ] ~ dpois ( l [ i ])
}
a ~ dexp (1)
b ~ dgamma (0.1 , 1.0)
}
Program 2.7: The Pumps example model from BUGS (OpenBugs, 2009).
( defn data []
[[94.3 15.7 62.9 126 5.24 31.4 1.05 1.05 2.1 10.5]
[5 1 5 14 3 19 1 1 4 22]
[10]])
vantage of writing this text using the FOPPL rather than an existing
language like BUGS is that FOPPL program are comparatively easy to
reason about and manipulate, since there are only 8 expression forms
in the language. In the next chapter we will exploit this in order to
mathematically define a translation from FOPPL programs to Bayesian
networks and factor graphs, keeping in mind that all the basic concepts
that we will employ also apply to other probabilistic programming
systems, such as BUGS.
Note that neither sample nor observe statements appear in the syntax,
and that procedure calls are allowed only for primitive operations, not
for defined procedures. Having these constraints ensures that expressions
E cannot depend on any probabilistic choices or conditioning.
2.4. A Simple Purely Deterministic Language 50
ρ, φ, e ⇓ G, E (3.1)
51
3.1. Compilation to a Graphical Model 52
(pnorm y 0.5)
V = {z, y},
A = {(z, y)},
P = [z 7→ ( pbern z 0.5),
y 7→ ( pnorm y (if (= z 0) -1.0 1.0) 1.0)],
Y = [y 7→ 0.5]
E= z
The vertex set V of the net G contains two variables, whereas the arc
set A contains a single pair (z, y) to mark the conditional dependence
relationship between these two variables. In the map P , the probability
mass for z is defined as the target language expression ( pbern z 0.5).
Here pbern refers to a function in the target languages that implements
probability mass function for the Bernoulli distribution. Similarly, the
density for y is defined using pnorm , which implements the probability
density function for the normal distribution. Note that the expression
for the program variable mu has been substituted into the density for
y. Finally, the map Y contains a single entry that holds the observed
value for y.
3.1. Compilation to a Graphical Model 54
( observe ( if (= z 0) d0 d1 ) y )
z)
Program 3.3: One-point mixture with explicit parameters
= ∞.
So what is going on here? This integral does not converge because
we have not assumed the correct support: We cannot marginalize
3.1. Compilation to a Graphical Model 59
R R
p(µ0 |z = 0) and R dµ1 p(µ1 |z = 1) if we assume p(µ0 |z = 0) = 1
R dµ0
and p(µ1 |z = 1) = 1. These uniform densities effectively specify im-
proper priors on unevaluated branches.
In order to make lazy evaluation of if expressions more well-behaved,
we could choose to define the support of the joint as a union over
supports for individual branches
We can now simply define p(µ0 ) = pnorm (µ0 ; −1, 1) and p(µ1 ) = pnorm (µ1 ; 1, 1)
and assume the same likelihood as in the equation in (3.2). This defines
a joint density that corresponds to what we would normally assume for
a mixture model. In this evaluation model, sample expressions in both
branches are always incorporated into the joint.
Unfortunately, eager evaluation would lead to counter-intuitive re-
sults when observe expressions occur in branches. To see this, Let us
consider the following form for our program
( let [ z ( sample ( bernoulli 0.5))
mu0 ( sample ( normal -1 .0 1.0))
mu1 ( sample ( normal 1.0 1.0))
y 0.5]
( if (= z 0)
( observe ( normal mu0 1) y )
( observe ( normal mu1 1) y ))
z)
Program 3.6: One-point mixture with observes inside if.
3.1. Compilation to a Graphical Model 60
Support-Related Subtleties
Translation rules
Now that we have developed some intuition for how one might translate
a program to a data structure that represents a graphical model and
3.1. Compilation to a Graphical Model 62
It states that if the statement top holds, so does the statement bottom.
For instance, the rule
ρ, φ, e ⇓ G, E
ρ, φ, (− e) ⇓ G, (− E) (3.5)
ρ, φ, c ⇓ Gemp , c ρ, φ, z ⇓ Gemp , z
where Gemp is the tuple (∅, ∅, [], []) and represents the empty graphical
model.
(G1 ⊕ G2 ) = (V1 ∪ V2 , A1 ∪ A2 , P1 ⊕ P2 , Y1 ⊕ Y2 )
ρ, φ, e1 ⇓ G1 , E1 ρ, (and φ E1 ) , e2 ⇓ G2 , E2
ρ, (and φ (not E1 )) , e3 ⇓ G3 , E3
ρ, φ, (if e1 e2 e3 ) ⇓ (G1 ⊕ G2 ⊕ G3 ), (if E1 E2 E3 )
score((if E1 E2 E3 ) , v) = (if E1 F2 F3 )
(when Fi = score(Ei , v) for i ∈ {2, 3} and it is not ⊥)
score((c E1 . . . En ) , v) = (pc v E1 . . . En )
(when c is a constructor for distribution and pc its pdf or pmf)
score(E, v) = ⊥
(when E is not one of the above cases)
ρ, φ, e1 ⇓ G1 , E1 ρ, φ, e2 ⇓ G2 , E2
(V, A, P, Y) = G1 ⊕ G2 Choose a fresh variable v
F1 = score(E1 , v) 6= ⊥ F = (if φ F1 1)
Z = (free-vars(F1 ) \ {v}) free-vars(E2 ) ∩ V = ∅
B = {(z, v) : z ∈ Z}
ρ, φ, (observe e1 e2 ) ⇓ (V ∪ {v}, A ∪ B, P ⊕ [v 7→ F ], Y ⊕ [v 7→ E2 ]), E2
Procedure Call The remaining two cases are those for procedure
calls, one for a user-defined procedure f and one for a primitive function
c. In both cases, we first translate arguments. In the case of primitive
functions we then translate the expression for the call by substituting
translated arguments into the original expression, and merging the
graphs for the arguments
ρ, φ, ei ⇓ Gi , Ei for all 1 ≤ i ≤ n
ρ, φ, (c e1 . . . en ) ⇓ G1 ⊕ . . . ⊕ Gn , (c E1 . . . En )
We can decompose the joint probability p(V ) into a prior and a likelihood
term. In our specification of the translation rule for observe, we require
that the expression Y(v) for the observed value may not have free
variables. Each expression Y(v) will hence simplify to a constant when we
perform partial evaluation, a subject we cover extensively in Section 3.2.2
of this chapter. We will use Y to refer to all the nodes in V that
correspond to observed random variables, which is to say Y = dom(Y).
Similarly, we can use X to refer to all nodes in V that correspond
to unobserved random variables, which is to say X = V \ Y . Since
observed nodes y ∈ Y cannot have any children we can re-express the
joint probability in Equation (3.7) as
where
Y Y
p(Y | X) = p(y | pa(y)), p(X) = p(x | pa(x)). (3.11)
y∈Y x∈X
score(c, v) = (pc v)
(when c is a distribution and pc is its pdf or pmf)
To see how partial evaluation also reduce the number of edges in the
Bayesian network, let us consider the expression (if true v1 v2 ) . This
expression nominally references two random variables v1 and v2 . After
partial evaluation, this expression simplifies to v1 , which eliminates the
spurious dependence on v2 .
Another practical advantage of partial evaluation is that it gives us
a simple way to identify expressions in a program that are fully deter-
ministic (since such expression will be partially evaluated to constants).
This is useful when translating observe statements (observe e1 e2 ) , in
which the expression e2 must be deterministic. In programs that use
the (loop c v e e1 . . . en ) syntactic sugar, we can now substitute any
fully deterministic expression for the number of loop iterations c. For
example, we could define a loop in which the number of iterations is
given by the dataset size.
Lists, vectors and hash maps. Eliminating spurious edges in the
dependency graph becomes particularly important in programs that
make use of data structures. Let us consider the following example,
which defines a 3-state Markov chain
( let [ A [[0.9 0.1]
[0.1 0.9]]
x1 ( sample ( discrete [1. 1.]))
x2 ( sample ( discrete ( get A x1 )))
x3 ( sample ( discrete ( get A x2 )))]
3.2. Evaluating the Density 72
[ x1 x2 x3 ])
However, for the third sample statement there will be arcs from both
v1 and v2 to v3 , since Ak expands to
The extra arc from v1 to v3 is of course not necessary here, since the
expression (last (append [v1 ] v2 )) will always evaluate to v2 . What’s
more, if we run this program to generate more than 3 states, the node
vn for the n-th state will have incoming arcs from all preceding variables
v1 , . . . , vn−1 , whereas the only real arc in the Bayesian network is the
one from vn−1 .
We can eliminate these spurious arcs by implementing an additional
set of partial evaluation rules for data structures,
eval((vector E1 . . . En ) ) = [E1 . . . En ],
eval((hash-map c1 E1 . . . cn En ) ) = {c1 E1 . . . cn En } .
These rules ensure that expressions which construct data structures are
partially evaluated to data structures containing expressions. We can
similarly partially evaluate functions that add or replace entries. For
example, we can define the following rules for the conj primitive, which
appends an element to a data structure,
In the Markov chain example, the expression for Ak in the third sample
statement then simplifies to
eval((last [E1 . . . En ]) ) = En ,
eval((get [E1 . . . En ] k) ) = Ek ,
eval((get {c1 E1 . . . cn En } ck ) ) = Ek .
This yields the correct dependency structure for the Bayesian network.
x ∈ free-vars(P(v)). (3.24)
Q := [x1 7→ E1 , . . . , xN 7→ EN ]. (3.27)
where two latent variables are highly correlated: when updating one
conditioned on a fixed value of the other, it is only possible to make very
small changes at a time. wIn contrast, a block proposal which updates
both these random variables at once can move them larger distances,
in sync. As a pathological example, consider the FOPPL program
( let [x0 ( sample ( normal 0 1))
x1 ( sample ( normal 0 1))]
( observe ( normal ( + x0 x1 ) 0.01) 2.0))
Note here that if when were able to perfectly integrate the equations
of motion in (3.34), then the sample is accepted with probability 1. In
other words, the acceptance ratio purely is purely due to numerical
errors that arise from the discretization of the equations of motion.
The first 3 base cases are trivial. This means that we can compute
the gradient of any target language expression E with respect to the
values of its free variables as long as we are able calculate the partial
derivatives of values returned by primitive procedure applications with
respect to the values of the inputs.
Let us discuss this last case in more detail. Suppose that f is a
primitive that accepts n real-valued inputs, and returns a real-valued
output c = f (c1 , . . . , cn ). In order to perform reverse-mode AD, we will
3.4. Hamiltonian Monte Carlo 85
replace this primitive with a “lifted” variant f˜ such that c̃ = f˜(c̃1 , . . . , c̃n )
will return a boxed value
which contains the return value c of f , the input values c̃i , and the
values of the partial derivatives ċi = ∂f (v1 , . . . , vn )/∂vi |vi =ci of the
output with respect to the inputs. Algorithm 2 shows pseudo-code
for an operation that constructs an AD-compatible primitive f˜ from
a primitive f and a second primitive ∇f that computes the partial
derivatives of f with respect to its inputs.
The boxed value c̃ is a recursive data structure that we can use to
walk the computation graph. Each of the input values c̃i corresponds
the value of a sub-expression that is either a constant, a variable, or the
return value of another primitive procedure application. The first two
cases correspond leaf nodes in the computation graph. In the case of a
variable v (Equation 3.42), we are at an input where the gradient is 1
for the component associated with v, and 0 for all other components.
We represent this sparse vector as a map G = [v 7→ 1]. When we reach a
constant value (Equation 3.41), we do don’t need to do anything, since
the gradient of a constant is 0. We represent this zero gradient as an
empty map G = [ ]. In the third case (Equation 3.44), we can recursively
3.4. Hamiltonian Monte Carlo 86
unpack the boxed values c̃i to compute gradients with respect to input
values, multiply the resulting gradient terms partial derivatives ċi , and
finally sum over i.
Algorithm 3 shows pseudo-code for an algorithm that performs the
reverse-mode gradient computation according to this recursive strategy.
In this algorithm, we need to know the variable v that is associated
with each of the inputs. In order to ensure that we can track these
correspondences we will box an input value c associated with variable v
into a pair c̃ = (c, v). The gradient computation in Algorithm 3 now
pattern matches against values c̃ to determine whether the value is
an input, and intermediate value that was returned from a primitive
procedure call, or any other value (which has a 0 gradient).
Given this implementation of reverse-mode AD, we can now compute
the gradient of the potential function in two steps
for the momentum R at time points that are shifted by /2 relative to
those at which we compute the position. There is a trade-off between
the choice of step size and the numerical stability of the integration
scheme, which affects the acceptance rate. Moreover, this step size
should also appropriately account for the choice of mass matrix M ,
which is generally chosen to match the covariance in the posterior
expectation Mij−1 ' Eπ(X) [xi xj ] − Eπ(X) [xi ]Eπ(X) [xj ]. Finally, modern
implemenations of HMC typically employ a No-Uturn sampling (NUTS)
scheme to ensure that the number of time steps T is chosen in a way
that minimizes the degree of correlation between samples.
An implementation consideration unique to probabilistic program-
ming is that not all FOPPL programs define densities γ(X) = p(Y, X)
that are differentiable at all points in the space. The same is true
for systems like Stan (Stan Development Team, 2014) and PyMC3
(Salvatier et al., 2016), which opt to provide users with a relatively
expressive modeling language that includes if expressions, loops, and
even recursion. While these systems enforce the requirement that a
program defines a density over set of continuous variables that is known
at compile time, they do not enforce the requirement that the density
is differentiable. For example, the following FOPPL program would be
perfectly valid when expressed as a Stan or PyMC3 model
( let
[ x ( sample ( normal 0.0 1.0))
y 0.5]
( if ( > x 0.0)
( observe ( normal 1.0 0.1) y )
( observe ( normal -1 .0 0.1) y )))
e1 ⇓f e01 e2 ⇓f e02
ρ, c ⇓f c ρ, v ⇓f v ρ, (let [v e1 ] e2 ) ⇓f (let [v e01 ] e02 )
e ⇓f e 0 e1 ⇓f e01 e2 ⇓f e02
ρ, (sample e) ⇓f (sample e0 ) ρ, (observe e1 e2 ) ⇓f (observe e01 e02 )
ρ, ei ⇓f e0i for i = 1, . . . , n op = if or op = c
ρ, (op e1 . . . en ) ⇓f (sample (dirac (op e01 . . . e0n )))
Figure 3.2: Inference rules for the transformation ρ, e ⇓f e0 , which replaces if forms
and primitive procedure calls with expressions of the form (sample (dirac e)) .
where the factors ψv (Xv ) equivalent to the expressions P(v) that eval-
uate the probability density for each variable v, which can be either
observed or unobserved,
P(v)[v := Y(v)], v ∈ dom(Y),
ψv (Xv ) ≡ (3.51)
P(v), v 6∈ dom(Y).
A graph in which we replace the factors f and g with the merged factor
h is then an equivalent representation of the density. The implication
of this is that there is a choice in the level of granularity at which
we wish to represent a factor graph. The representation above has a
comparatively low level of granularity. We will here consider a more
3.5. Compilation to a Factor Graph 91
(sample (dirac e0 ))
pdirac (x ; c) = I[x = c]
After this source code transformation, we can use the rules from Sec-
tion 3.1 to compile the transformed program into a directed graphical
model (V, A, P, Y). This model will be equivalent to the directed graph-
ical model of the untransformed program, but contains an additional
node for each Dirac-distributed deterministic variable.
The inference rules for the source code transformation ρ, e ⇓f e0 are
much simpler than the rules that we wrote down in Section 3.1. We
show these rules in Figure 3.2. The first two rules state that constants
c and variables v are unaffected. The next rules state that let, sample,
and observe forms are transformed by transforming each of the sub-
expressions, inserting deterministic variables where needed. User-defined
procedure calls are similarly transformed by transforming each of the
arguments e1 , . . . , en , and transforming the procedure body e0 . So
far, none of these rules have done anything other than state that we
transform an expression by transforming each of its sub-expressions.
The two cases where we insert Dirac-distributed variables are if forms
and primitive procedure applications. For these expression forms e, we
transform the sub-expressions to obtain a transformed expression e0
and then return the wrapped expression (sample (dirac e0 )).
As noted above, a directed graphical model can always be interpreted
as a factor graph that contains single factor for each random variable. To
aid discussion in the next section, we will explicitly define the mapping
from the directed graph (V dg , Adg , P dg , Y dg ) of a transformed program
3.5. Compilation to a Factor Graph 92
X := X dg = V dg \ dom(Y dg ). (3.53)
The map Ψ contains an expression Ψ(f ) for each factor, which evaluates
the potential function of the factor ψf (Xf ). We define Ψ(f ) in terms of
the of the corresponding expression for the conditional density P dg (vf ),
P dg (v )[v := Y dg (v )], vf ∈ dom(Y dg ),
f f f
Ψ(f ) := (3.55)
P dg (vf ), vf 6∈ dom(Y dg ).
This defines the usual correspondence between ψf (Xf ) and Ψ(f ), where
we note that the set of variables Xf associated with each factor f is
equal to the set of variables in Ψ(f ),
This program differs from the ones we have considered so far in that
we are using a Dirac delta to enforce a hard constraint on observations,
which means that this program defines an unnormalized density
I[(s1 −s2 )>]
γ(s1 , s2 ) = pnorm (s1 ; 0, 1) pnorm (s2 ; 0, 1) . (3.57)
This type of hard constraint poses problems for many inference al-
gorithms for directed graphical models. For example, in HMC this
introduces a discontinuity in the density function. However, as we will
see in the next section, inference methods based on message passing are
much better equipped to deal with this form of condition.
When we compile the program above to a factor graph we obtain a
3.6. Expectation Propagation 94
fs2 →
7 (pnorm s2 0.0 1.0) ,
Ψ = fδ → 7 (pdirac δ (- s1 s2 )) ,
. (3.58)
f →
w 7 (pdirac w (> δ 0.1)) ,
fy 7→ (pdirac true w)
Note here that the variables s1 and s2 would also be present in the
directed graph corresponding to the unstransformed program. The
deterministic variables δ and w have been added as a result of the
transformation in Figure 3.2. Since the factor fy restricts w to the value
true, we can eliminate fy from the set of factors and w from the set of
varables. This results in a simplified graph where X = (s1 , s2 , δ) and
the potentials
fs1 7→ (pnorm 0.0 1.0) ,
If we assume that the parameters λ∗x satisfy the condition above, then
we can use Equation 3.68 to define the update for the message φf →x
λf →x ← λ∗x −
X
λf 0 →x . (3.74)
f 0 6=f : x∈Xf 0
1
Z Y
= dXf ψf (Xf ) φx→f (x). (3.75)
x∈Xf
Zxq
3.6. Expectation Propagation 98
Here, the function φx→f (x) is known as the message from the variable
v to the factor f , which is defined as
Y Y
φx→f (x) := φf 0 →x (x). (3.76)
x∈Xf f 0 6=f : x∈Xf 0
These messages can also be used to compute the second set of integrals
for the sufficient statistics
1 Y 1
Z
t̄ = Eπf (V ) [t(v)] = dVf t(x)ψf (Xf ) φx→f (x). (3.77)
Zf x∈X
Zxq
f
for the sufficient statistics in Equation (3.77) for each potential type.
This requirement imposes certain constraints on the programs we can
write. The cases that we have to consider are stochastic factors (sample
and observe expressions) and deterministic factors (if expressions and
primitive procedure calls).
For sample and observe expressions, potentials have the form Ψ(f ) =
(p v0 v1 . . . vn ) and Ψ(f ) = (p c0 v1 . . . vn ) respectively. For these
potentials, we have to integrate over products of densities, which can in
general be done only for a limited number of cases, such as conjugate
prior and likelihood pairs. This means that the exponential family that
is chosen for the messages needs to be compatible with the densities in
sample and observe expressions.
Deterministic factors take the form (pdirac v0 E) where E is an
expression in which all sub-expressions are variable references,
E ::= ( if v1 v2 v3 ) | (c v1 . . . vn )
102
103
L
p(X|Y ) 1X
Eq(X) r(X) ' wl r(X l ).
q(X) L l=1
p(Y, X l )
W l := = p(Y ) wl . (4.1)
q(X l )
If we substitute p(X|Y ) = p(Y, X)/p(Y ) then we can re-express the
expectation over q(X) in terms of the unnormalized weights,
p(X|Y ) 1 p(Y, X)
Eq(X) r(X) = Eq(X) r(X) , (4.2)
q(X) p(Y ) q(X)
L
1 1X
' W l r(X l ), (4.3)
p(Y ) L l=1
This solves one problem, since the unnormalized weights W l are quan-
tities that we can calculate directly, unlike the normalized weights wl .
However, we now have a new problem: We also don’t know how to
calculate the normalization constant p(Y ). Thankfully, we can derive
an approximation to p(Y ) using the same unnormalized weights W l by
4.1. Likelihood Weighting 107
• For program variables v, the evaluator returns the value `(v) that
is stored in the local environment.
• For sample forms (sample e), we first evaluate the distribution ar-
gument e to obtain a distribution value d. We then call sample(d)
to generative a sample from this distribution. Here sample is a
function in the language that implements the evaluator, which
needs to be able to generate samples of each distribution type in
the FOPPL (in other words, we can think of sample as a required
method for each type of distribution object).
• For observe forms (observe e1 e2 ) we first evaluate the argument
e1 to a distribution d1 and the argument e2 to a value c2 . We
then update a variable σ(log W ), which is stored in the inference
state, by adding log-prob(d1 , c2 ), which is the log likelihood of
c2 under the distribution d1 . Finally we return c2 . The function
log-prob similarly needs to be able to compute log probability
densities for each distribution type in the FOPPL.
3. Construct an expression E ← (c E1 . . . En )
In other words, inference rules do not only formally specify how our
translation should behave, but also give us a recipe for how to implement
a recursive translate operation for each expression type.
This similarity is not an accident. In fact, inference rules are com-
monly used to specify the big-step semantics of a programming language,
which defines the value of each expression in terms of the values of
its sub-expressions. We can similarly use inference rules to define our
evaluation-based likelihood weighting method. We show these inference
rules in Figure 4.1.
`(v) = c ρ, `, e1 ⇓ c1 , l1 ρ, ` ⊕ [v1 7→ c1 ], e0 ⇓ c0 , l0
ρ, `, c ⇓ c, 0 ρ, `, v ⇓ c ρ, `, (let [v1 e1 ] e0 ) ⇓ c0 , l0 + l1
ρ, `, e1 ⇓ true, l1 ρ, `, e1 ⇓ false, l1
ρ, `, e2 ⇓ c2 , l2 ρ, `, e3 ⇓ c3 , l3
ρ, `, (if e1 e2 e3 ) ⇓ c2 , l1 + l2 ρ, `, (if e1 e2 e3 ) ⇓ c3 , l1 + l3
ρ, `, ei ⇓ ci , li for i = 1, . . . , n c(c1 , . . . , cn ) = c0
ρ, `, (c e1 . . . en ) ⇓ c0 , l1 + . . . + ln
Figure 4.1: Big-step semantics for likelihood weighting. These rules define an
evaluation operation ρ, `, e ⇓ c, l, in which ρ and ` refers to the global and local
environment, refers to the local environment, e is an expression, c is the value of the
expression and l is its log likelihood.
the expression for its observed value e2 , then the program would still
produce the same distribution on return values when sampling from the
prior, but the log weight σ(log W ) would be 0 after every execution.
The distinction between referentially transparent and opaque expres-
sions also implicitly showed up in our compilation procedure in Section
3.1. Here we translated an opaque program into a set of target-language
expressions for conditional probabilities, which were referentially trans-
parent. In these target-language expressions, each sub-expression cor-
responding to sample or observe was replaced with a free variable v.
If a translated expression has no free variables, then the original un-
translated expression is referentially transparent. In Section 3.2.2, we
implicitly exploited this property to replace all target-language expres-
sions without free variables with their values. We also relied on this
property in Section 3.1 to ensure that observe forms (observe e1 e2 )
always contained a referentially transparent expression for the observed
value e2 .
4.2 Metropolis-Hastings
following pseudo-algorithm.
- Initialize the current sample X. Return X 1 ← X.
- For each subsequent sample s = 2, . . . , S
- Generate a proposal X 0 ∼ q(X 0 | X)
- Calculate the acceptance ratio
p(Y 0 , X 0 )q(X | X 0 )
α= (4.8)
p(Y, X)q(X 0 | X)
- Update the current sample X ← X 0 with probability max(1, α),
otherwise keep X ← X. Return X s ← X.
An evaluation-based implementation of a MH sampler needs to do two
things. It needs to be able to run the program to generate a proposal,
conditioned on the values X of sample expressions that were evaluated
previously. The second is that we need to be able to compute the
acceptance ratio α as a side effect.
Let us begin by considering a simplified version of this algorithm.
Suppose that we defined q(X 0 |X) = p(X 0 ). In other words, at each
step we generate a sample X 0 ∼ p(X) from the program prior, which is
independent of the previous sample X. We already know that we can
generate these samples simply by running the program. The acceptance
ratio now simplifies to:
p(Y 0 , X 0 )q(X | X 0 ) p(Y 0 | X 0 )p(X 0 )p(X) p(Y 0 | X 0 )
α= = = (4.9)
p(Y, X)q(X 0 | X) p(Y | X)p(X)p(X 0 ) p(Y | X)
In other words, when we propose from the prior, the acceptance ratio
is simply the ratio of the likelihoods. Since our likelihood weighting
algorithm computes σ(log W ) = log p(Y | X) as a side effect, we can re-
use the evaluator from Algorithm 7 and simply evaluate the acceptance
ratio as W 0 /W , where W 0 = p(Y 0 |X 0 ) is the likelihood of the proposal
and W = p(Y |X) is the likelihood associated with the previous sample.
Pseudo-code for this implementation is shown in Algorithm 8.
Since all variables in X 0sampled were sampled from the program prior,
the proposal probability is
Since some of the terms in the prior and the proposal cancel, the ratio
p(Y 0 , X 0 )/q(X 0 |X, x0 ) simplifies to
p(Y 0 , X 0 )
P 0 (y) P 0 (x)
Y Y
0
= (4.18)
q(X |X, x0 ) y∈Y 0
x∈X 0reused
We can define the ratio p(Y, X)/q(X|X 0 , x0 ) for the reverse transition
by noting that this transition would require sampling a set of variables
X sampled from the prior whilst reusing a set of variables X reused
p(Y, X) Y Y
= P(y) P(x). (4.19)
q(X|X, x0 ) y∈Y
x∈X reused
Here the set of reused variable X reused for the reverse transition is, by
definition, identical that of the forward transition X 0reused ,
ρ, e1 ⇓α e01 ρ, e0 ⇓α e00
ρ, c ⇓α c ρ, v ⇓α v ρ, (let [v1 e1 ] e0 ) ⇓α (let [v1 e01 ] e00 )
ρ, ei ⇓α e0i for i = 1, . . . , n op = if or op = c
ρ, (op e1 . . . en ) ⇓α (op e01 . . . e0n )
Addressing Transformation
Evaluating Proposals
σ(x0 ), which is the site of the proposal, a map σ(X ) map σ(log P),
which holds the log density for each variable, and finally a “cache” σ(C)
of values that we would like to condition the execution on.
For a sample expression with address v, we reuse the value X (v) ←
C(v) when possible, unless we are evaluating the proposal site v = x0 .
In all other cases, we sample X (v) from the prior. For both sample and
observe expressions we calculate the log probability log P(v).
The Metropolis Hastings implementation is shown in Algorithm 11.
This algorithm initializes the state σ sample by evaluating the program,
storing the values σ(X ) and log probabilities σ(log P) for the current
sample. For each subsequent sample the algorithm then selects the
initial site x0 at random from the domain of the current sample σ(X ).
We then rerun the program accordingly to construct a proposal and
either accept or reject according to the ratio defined in Algorithm 9.
4.3. Sequential Monte Carlo 125
γ1 (X1l )
X1l ∼ q1 (X1 ), W1l := . (4.23)
q1 (X1l )
4.3. Sequential Monte Carlo 126
Wnl := Wn\n−1
l
Ẑn−1 (4.26)
l
where Wn\n−1 is the incremental weight
l γn (Xnl )
Wn\n−1 := , (4.27)
aln−1 aln−1
γn−1 (Xn−1 )qn (Xnl | Xn−1 )
X1 ⊆ X2 ⊆ . . . ⊆ XN = X, (4.29)
Y1 ⊆ Y2 ⊆ . . . ⊆ YN = Y. (4.30)
In this section we will assume that we first apply the addressing trans-
formation from Section 4.2.1 to a FOPPL program. We then assume
that the user identifies a sequence of symbols y1 , . . . , yN −1 for observe
expressions that satisfy the two properties above. An alternative design,
which is often used in practice, is to simply break at every observe and
assert that each sample has halted at the same point at run time.
γn (Xn ) = p(Yn , Xn )
= p(Yn |Yn−1 , Xn )p(Xn |Xn−1 )p(Yn−1 , Xn−1 )
= p(Yn |Yn−1 , Xn )p(Xn |Xn−1 )γn−1 (Xn−1 ).
l p(Ynl | Xnl ) Y
Wn\n−1 = = p(y | Xnl ), (4.31)
aln−1 aln−1
p(Yn−1 | Xn−1 ) l
y∈Yn\n−1
l
where Yn\n−1 is the set difference between the observed variables at
generation n and the observed variables at generation n − 1.
l al
Yn\n−1 = dom(Ynl ) \ dom(Yn−1
n−1
).
13: for l in 1, . . . , L do
14: aln−1 ∼ discrete(Wn−1 1:L / P W l
l n−1 )
15: if n < N then
aln−1
16: (Xnl , log Λln ) ← propose(Xn−1 , yn )
17: else
al −1
18: (rl , log ΛlN ) ← propose(XNN−1 , nil)
al
19: log Wnl ← log Λln − log Λn−1
n−1
+ log Ẑn−1
20: return ((r1 , log WN1 ), . . . , (rL , log WNL ))
L
!
1X
1:L
log-mean-exp(log Wn−1 ) = log Wl . (4.33)
L l=1 n−1
4.4. Black Box Variational Inference 131
p(Y, X)
L(λ) := Eq(X;λ) log ,
q(X; λ) (4.34)
= log p(Y ) − DKL (q(X; λ) || p(X|Y )) ≤ log p(Y ).
q(X; λ)
DKL (q(X; λ) || p(X)) := Eq(X;λ) log . (4.35)
p(X|Y )
p(Y, X)
Eq(X;λ) ∇λ log = −Eq(X;λ) [∇λ log q(X; λ)]
q(X; λ)
Z
= − dX q(X; λ)∇λ log q(X; λ) (4.39)
Z
= − dX ∇λ q(X; λ) = −∇λ 1 = 0,
where the final equalities make use of the fact that, by definition,
R
dX q(X; λ) = 1 since a probability distribution is normalized.
If we additionally substitute ∇λ q(X; λ) := q(X; λ)∇λ log q(X; λ) in
Equation (4.38), then we can express the gradient of the ELBO as
p(Y, X)
∇λ L(λ) = Eq(X;λ) ∇λ log q(X; λ) log −b , (4.40)
q(X; λ)
where b is arbitrary constant vector, which does not change the expected
value since Eq(X;λ) [∇λ log q(X; λ)] = 0.
The likelihood-ratio estimator for the gradient of the ELBO ap-
proximates the expectation with a set of samples X l ∼ q(X; λ). If we
define the standard importance weight W l = p(Y l , X l )/q(X l ; λ), the
the likelihood-ratio estimator is defined as
L
ˆ 1X
∇λ L(λ) := ∇λ log q(X l ; λ) log W l − b̂ . (4.41)
L l=1
vector λv , then the variance reduction constant b̂v,d for the component
λv,d is defined as
1:L , G1:L )
covar(Fv,d v,d
b̂v,d := , (4.42)
var(G1:L
v,d )
l
Fv,d := ∇λv,d log q(Xvl ; λv ) log W l , (4.43)
Glv,d := ∇λv,d log q(Xv1:L ; λv ). (4.44)
From the equations above, we see that we need to calculate two sets
of quantities in order to estimate the gradient of the ELBO. The
first consists of the importance weights log W l . The second consists
of the gradients of the log proposal density for each variable Glv =
4.4. Black Box Variational Inference 135
can easily be order 103 to 104 . This means that BBVI we may need
order 106 or more samples before BBVI starts generating high quality
proposals.
When we compile a program to a graph (V, A, P, Y) we can perform
an additional optimization to reduce the variance. To do so, we replace
the term log W in the objective with a vector in which each component
log Wv contains a weight that is restricted to the variables in the Markov
blanket,
X p(w|pa(w))
log Wv = , (4.45)
w∈mb(v)}
q(w|λ w )
• Functions are not first class objects, which means that it is not
138
139
• The first argument to the loop form, the loop depth, has to be
a constant, as loop was syntactic sugar unrolled to nested let
expressions at compile time.
Say that we wish to remove this last restriction, and would like to
be able to loop over the range determined by the runtime value of a
program variable.
This means that the looping construct cannot be syntactic sugar, but
must instead be a function that takes the loop bound as an argument
and repeats the execution of the loop body up to this dynamically-
determined bound.
If we wanted to implement a loop function that supports a dynamic
number of loop iterations, then we could do so as follows
( defn loop-helper [i c v f a1 . . . an ]
( if (= i c)
v
( let [v 0 (f i v a1 . . . an )]
( loop-helper ( + i 1) c v 0 f a1 . . . an ))))
( defn loop [c v f a1 . . . an ]
( loop-helper 0 c v f a1 . . . an )).
This program, which represents the distribution over the sums of the
outcomes of a Poisson distributed number of of fair coin flips, is one of
the shortest programs that illustrates concretely what we mean by a
140
( defn geometric [ p ]
( let [ dist ( flip p )]
( if ( sample dist )
0
( if ( sample dist )
1
( if ( sample dist )
2
.
.
.
( if ( sample dist )
∞
( geometric-helper ( + ∞ 1))))))))
5.1 Syntax
v ::= variable
c ::= constant value or primitive operation
f ::= procedure
e ::= c | v | f | ( if e e e) | (e e1 . . . en ) | ( sample e)
| ( observe e e) | ( fn [v1 . . . vn ] e)
q ::= e | ( defn f [v1 . . . vn ] e) q.
Language 5.4: Higher-order probabilistic programming language (HOPPL)
For instance,
( let [ a ( + k 2)
b ( * a 6)]
( print ( + a b ))
( * a b ))
With this loop and let sugar defined, and the implementation of foreach
straightforward, any valid FOPPL program is also valid in the HOPPL.
5.3 Examples
( defn read-data []
( r e a d- d a ta - f ro m - di s k " filename . csv " ))
The repeatedly function can stand in for the fixed-length loops that
we used to sample mixture components from the prior in the FOPPL
implementation. An example implementation is in Program 5.5.
( defn sampl e-like lihoo d []
( let [ sigma ( sample ( gamma 1.0 1.0))
mean ( sample ( normal 0.0 sigma ))]
( normal mean sigma )))
ys ))
Program 5.5: HOPPL: An open-universe Gaussian mixture model with an unknown
number of components
Here we still used a fixed, small data set (the ys values, same as
before, are inlined) but the model code would not change if this were
replaced by a larger data set. Models such as this one, where the
distribution over the number of mixture components K is unbounded
above, are sometimes known as open-universe models: given a small
amount of data, we may infer there are only a small number of clusters;
however, if we were to add more and more entries to ys and re-run
inference, we do not discount the possibility that there are additional
clusters (i.e. a larger value of K) than we had previously considered.
Notice that the way we wrote this model interleaves sampling from z
with observing values of y, rather than sampling all values z1 , z2 , z3 , . . .
up front. While this does not change the definition of the model (i.e. does
not change the joint distribution over observed and latent variables),
writing the model in a formulation which moves observe statements as
early as possible (or alternatively delays calls to sample) yields more
efficient SMC inference.
Program synthesis
We can sample from the space of all functions f (x) generated by com-
position of digits with +, -, *, and /, by starting from the initial rule
for expanding f and recursively applying rules to fill in values of op,
num, and e until only terminals remain. To do so, we need to assign a
probability for sampling each rule at each stage of the expansion. In the
following example, when expanding each e we choose a number with
probability 0.4, the symbol x with probability 0.3, and a new function
application with probability 0.3; both operations and numbers 0, . . . , 9
are chosen uniformly.
( defn gen-operation []
( sample ( uniform [ + - * / ])))
5.3. Examples 150
( defn gen-expr []
( let [ expr-prior ( discrete [0.4 0.3 0.3])
expr-type ( sample expr-prior )]
( case expr-type
0 ( sample ( uniform-discrete 0 10))
1 ( quote x )
2 ( let [ operation ( gen-operation )]
( list operation
( gen-expr )
( gen-expr ))))))
( defn gen-function []
( list ( quote fn ) [( quote x )]
( list ( gen-operation )
( gen-expr )
( gen-expr ))))
Program 5.6: generative model for function of a single variable
( fn [ x ] (- ( / ( - ( * 7 0) 2) x ) x ))
( fn [ x ] (- x 8))
( fn [ x ] (* 5 8))
( fn [ x ] (+ 7 6))
( fn [ x ] (* x x ))
( fn [ x ] (* 2 ( + 0 1)))
( fn [ x ] (/ 6 x ))
( fn [ x ] (- 0 ( + 0 ( + x 5))))
( fn [ x ] (- x 6))
( fn [ x ] (* 3 x ))
( fn [ x ]
(+ (+ 2
(- (/ x x)
( - x ( / ( - ( - 4 x ) ( * 5 4))
( * 6 x )))))
x ))
( fn [x] (- x ( + 7 ( + x 4))))
( fn [x] (+ ( - ( / ( + x 3) x ) x ) x ))
( fn [x] (- x ( * ( / 8 ( / ( + x 5) x )) ( - 0 1))))
( fn [x] (/ ( / x 7) 7))
( fn [x] (/ x 2))
( fn [x] (* 8 x ))
Program 5.7: Unconditioned samples from a generative model for arithmetic
expressions, produced by calling (gen-function)
Most of the generated expressions are fairly short, with many contain-
ing only a single function application. This is because the choice of
probabilities in Program 5.6 is biased towards avoiding nested function
applications; the probability of producing a number or the variable x is
0.7, a much larger value than the probability 0.3 of producing a function
application. However, there is still positive probability of sampling an
expression of any arbitrarily large size — there is nothing which explic-
itly bounds the number of function applications in the model. Such a
model could not be written in the FOPPL without introducing a hard
bound on the recursion depth. In the HOPPL we can allow functions to
grow long if necessary, while still preferring short results, thanks to the
eager evaluation of if statements and the lack of any need to enumerate
possible random choices.
Note that some caution is required when defining models which
can generate a countably infinite number of latent random variables:
5.3. Examples 152
Captcha-breaking
We can also now revisit the Captcha-breaking example we discussed
in Chapter 1, and write a generative model for Captcha images in the
5.3. Examples 153
7
6 5
5 4
4 3
3 2
2
1
1
0
0
4 2 0 2 4 6 4 2 0 2 4 6
(fn [x] (/ (+ 0 (* 7 (- 5 x))) (+ 8 (/ (* x x) (+ 4 (- 3 x)))))) (fn [x] (/ (+ 0 (* 7 (- 5 x))) (+ 8 (/ (* x x) (+ (/ x 8) 5)))))
50
40 8
30 6
20
4
10
0 2
10
0
20
4 2 0 2 4 6 4 2 0 2 4 6
(fn [x] (/ (+ 0 (* 7 (- 5 x))) (+ 8 (/ (* x x) (+ (/ (- 8 x) x) 5))))) (fn [x] (/ (+ 0 (* 8 (- 5 x))) 9))
Figure 5.1: Examples of posterior sampled functions, drawn from the same MH
chain.
Here we treated the render function as a black box, and just assumed
it could be called by the HOPPL program. In fact, so long as render
has purely deterministic behavior and no side-effects it can actually be
written in another language entirely, or even be a black-box precompiled
binary; it is just necessary that it can be invoked in some manner by
the HOPPL code (e.g. through a foreign function interface, or some
sort of inter-process communication).
6
Evaluation-Based Inference II
155
6.1. Explicit separation of model and inference code 156
address with the messages that need to be sent at each sample and
observe expression. The second transformation converts the HOPPL
program to continuation passing style. Unlike the graph compiler in
Chapter 3 and the custom evaluators in Chapter 4, both these code
transformations take HOPPL programs as input and then yield output
which are still HOPPL programs — they do not change the language.
If the HOPPL has an existing efficient compiler, we can still use that
compiler on the addressed and CPS-transformed output code. Once we
have our model code transformed into this format, we show how we
can implement a thin client-server layer and use this to define HOPPL
variants of many of the evaluation-based inference algorithms from
Chapter 4; this time, without needing to write an explicit evaluator.
v, α ⇓addr v f, α ⇓addr f
e1 , α ⇓addr e01
e2 , α ⇓addr e02 e3 , α ⇓addr e03
(if e1 e2 e3 ) , α ⇓addr (if e01 e02 e03 )
This is not the only choice one could make for this rule, as making
an address more complex is completely fine so long as each random
variable remains uniquely identifiable. If one were to desire interpretable
addresses one might wish to add to the address, in a manner somewhat
similar to the rules that immediately follow, a value that indicates the
conditional branch. This could be useful for debugging or other forms
of graphical model inspection.
can be any, such as an integer counter for the number of function calls
in the program source code, or a random hash. Alternatively, if we want
addresses to be human-readable, then c can be string representation of
the expression (e0 e1 . . . en ) or its lexical position in the source code.
The translation rules sample and observe can be thought of as
special cases of the rule for general function application.
e, α ⇓addr e0 Choose a fresh value c
(sample e) ⇓addr (sample (push-addr α c) e0 )
e1 , α ⇓addr e01
e2 , α ⇓addr e02 Choose a fresh value c
(observe e1 e2 ) ⇓addr (observe (push-addr α c) e01 e02 )
The result of this translation is that each sample and observe expression
in a program will now have a unique address associated with it. These
addresses are constructed dynamically at run time, but are well-defined
in the sense that a sample or observe expression will have an address
that is fully determined by its call stack. This means that this address
scheme is valid for any HOPPL program, including programs that can
instantiate an unbounded number of variables.
Now that each function call in the program has been augmented with
an address that tracks the location in the program execution, the next
step is to transform the computation in a manner that allows us to
pause and resume, potentially forking it multiple times if needed. The
continuation-passing-style (CPS) transformation is a standard method
from functional programming that achieves these goals.
A CPS transformation linearizes a computation into a sequence
of stepwise computations. Each step in this computation evaluates an
expression in the program and passes its value to a function, known as
a continuation, which carries out the remainder of the computation. We
can think of the continuation as a “snapshot” of an intermediate state
in the computation, in the sense that it represents both the expressions
that have been evaluated so far, and the expressions that need to be
evaluated to complete the computation.
In the context of the messaging interface that we define in this
chapter, a CPS transformation is precisely what we need to implement
pausing, resuming, and forking. Once we transform a HOPPL program
into CPS form, we gain access to the continuation at every sample and
observe expression. This continuation can be called once to continue
the computation, or multiple times to fork the computation.
There are many ways of translating a program to continuation
passing style. We will here describe a relatively simple version of the
transformation; for better optimized CPS transformations, see Appel
(2006). We define the ⇓c relation
e, κ, σ ⇓c e0 .
q, σ ↓c q 0 .
6.3. Continuation-Passing-Style Transformation 166
These two rules are unique for the CPS transform of probabilistic pro-
gramming languages. They replace observe and sample operators with
their CPS equivalents observe and sample, which take two additional
parameters κ for the current continuation and σ for the current state.
In this translation we assume that an addressing tranformation has
already been applied to add an address eaddr as an argument to sample
and observe.
6.4. Message Interface Implementation 171
Now that we have inserted addresses into our programs, and transformed
them into CPS, we are ready to perform inference. To do so, we will
implement the messaging interface that we outlined in Section 6.1. In this
interface, an inference controller process starts copies of the probabilistic
program, which are interrupted at every sample and observe expression.
Upon interruption, each program execution sends a message to the
controller, which then carries out any inference operations that need
to be performed. These operations can include sampling a new value,
reusing a stored value, computing the log probabilies, and resampling
program executions. After carrying out these operations, the controller
sensds messages to the program executions to continue or fork the
computation.
As we noted previously, the messaging interface creates an abstrac-
tion boundary between the controller process and the execution process.
6.4. Message Interface Implementation 172
As long as each process can send and receive messages, it need not have
knowledge of the internals of the other process. This means that the
two processes can be implemented in different languages if need be, and
can even be executed on different machines.
In order to clearly highlight this separation between model execution
and inference, we will implement our messaging protocol using a client-
server architecture. The server carries out program executions, and
the client is the inference process, which sends requests to the server
to start, continue, and fork processes. We assume the existence of an
interface that supports centrally-coordinated asynchronous message
passing in the form of request and response. Common networking
packages such as ZeroMQ (Powell, 2015) provide abstractions for these
patterns. We will also assume a mechanism for defining and serializing
messages, e.g. protobuf (Google, 2018). At an operational level, the
most important requirement in this interface is that we are able to
serialize and deserialize distribution objects, which effectively means
that the inference process and the modeling language must implement
the same set of distribution primitives.
( defn observe [α d c κ σ]
[ " observe " α d c κ σ ])
( defn return [c σ]
[ " return " c σ ])
6.6 Metropolis-Hastings
The primary difference between this algorithm and that of Section 4.2
is due to the dynamic addressing. In the FOPPL, each function is
guaranteed to be called a finite number of times. This means that
we can unroll the entire computation, inlining functions, and literally
annotate every sample and observe that can ever be evaluated with a
unique identifier. In the HOPPL, programs can have an unbounded
number of addresses that can be encountered, which necessitates the
addressing transformation that we defined in Section 6.2.
As in the evaluator-based implementation in Section 4.2, the infer-
ence controller maintains a trace X for the current sample and a trace
X 0 for the current proposal, which track the values for sample form that
is evaluated. We also maintain maps log P and log P 0 that hold the log
probability for each sample and observe form. The acceptance ratio is
6.6. Metropolis-Hastings 177
While the previous two algorithms were very similar to those presented
for the FOPPL, running SMC in the HOPPL context is slightly more
complex, though doing so opens up significant opportunities for scaling
and efficiency of inference. We will need to take advantage of the "fork"
message, and due to the (potentially) asynchronous nature in which
the HOPPL code is executed, we will need to be careful in tracking
execution ids of particular running copies of the model program.
An inference controller for SMC is shown in Algorithm 18. As in
the implementation of likelihood weighting, we start L executions in
parallel, and then listen for responses. When an execution reaches
a sample interrupt, we simply sample from the prior and continue
execution. When one of the executions reaches an observe, we will need
to perform a resampling step, but we cannot do so until all executions
have arrived at the same observe. For this reason we store the address
of the current observe in a variable αcur , and use a particle counter l
to track how many of executions have arrived at the current observe.
6.7. Sequential Monte Carlo 179
181
7.1. Inference Compilation 182
Compilation Inference
the FOPPL. Le et al. (2017a) uses the same training objective, which
we will describe next, but shows how to construct a neural inference
compilation artifact compatible with HOPPL program inference.
We will follow Le et al. (2017a) and describe, briefly, the idea
for HOPPLs. Recall Section 4.1.1 in which importance sampling in
its general form was discussed. Immediately after the presentation of
importance sampling a choice of the proposal distribution was made,
namely, the prior, and this choice was fixed for the remainder leading
to discussion of likelihood weighting rather than importance sampling
throughout. In particular, in Chapters 4 and 6 where evaluation-based
inference was introduced, in both likelihood weighting and SMC, the
weights computed and combined were always simply the observe log
likelihoods owing to the choice of prior as proposal.
This choice is easy — propose by simply running the program
forward — and always available, but is not necessarily a good choice. In
particular, when observations are highly informative, proposing from the
prior distribution may be arbitrarily statistically inefficient (Del Moral
and Murray, 2015). To motivate this approach to inference compilation,
we note that this choice is not the only possibility, and if a good
proposal were available then the quality of inference performed could
be substantially improved in terms of number of effective samples per
7.1. Inference Compilation 184
unit computation.
Consider, for the moment, what would happen if someone provided
you with a “good” proposal for every address α
noting that this is not the incremental prior and that it in general
depends on all observations Y . Here we assume that the n-th sample is
drawn at α for some n, and write Xn−1 for the samples drawn beforehand.
The likelihood-weighting evaluators can be transformed into importance
sampling evaluators by changing the sample method implementations to
draw xα according to Equation (7.1) instead of pα (xα |Xn−1 ). The rules
for sample would then need to generate a weight too (as opposed to
generating such side-effects at the evaluation of only observe statements,
not sample statements). This weight would be
pα (xα |Xn−1 )
Wα` = (7.2)
qα (xα |Xn−1 , Y )
leading to, for importance sampling rather than likelihood weighting, a
total unnormalized weight per trace ` of
Y pα (xα |Xn−1 )
W ` = p(Y |X) . (7.3)
q (x |Xn−1 , Y )
α∈dom(X) α α
It might seem like this tutorial has implicitly advocated for unsupervised
model-based inference. One thing machine learning and deep learning
have reminded us over and over again is that writing good, detailed,
descriptive, and useful models is itself very hard. Time and again,
data-driven discriminative techniques have been shown to be generally
superior at solving specific posed tasks, with the significant caveat that
large amounts of labeled training data are typically required. Taking
inspiration from this encourages us to think about how we can use
data-driven approaches to learn generative models, either in part or in
whole, rather than writing them entirely by hand. In this section we
look at how probabilistic programming techniques can generalize such
that “bottom-up” deep learning and “top-down” probabilistic modeling
can work together in concert.
Top-down model specification is equivalent to the act of writing a
probabilistic program: top-down means, roughly, starting with latent
variables and performing a forward computation to arrive at an ob-
servation. Our journey from FOPPL to HOPPL merely increased our
7.2. Model Learning 187
= ELBO(θ, φ, Y ). (7.9)
Provided we assume that the generative model is differentiable with
respect to its parameters θ, the sampling process for drawing a ran-
7.2. Model Learning 189
from the recent realization that general purpose inference methods like
those utilized in probabilistic programming offer an avenue for tighten-
ing the lower bound for model evidence during variational autoencoder
training while remaining compatible with more richly structured models
and inference networks. Arguably the first example of this was the
importance weighted autoencoder (Burda et al., 2016) which, effectively,
uses an importance sampling inference engine during variational autoen-
coder training to both tighten the bound and minimize the variance
of gradient estimates during training. A more advanced version of this
idea that uses sequential Monte Carlo instead of importance sampling
has appeared recently in triplicate simultaneously (Le et al., 2017c;
Naesseth et al., 2017; Maddison et al., 2017). These papers all roughly
identify and exploit the fact that the marginal probability estimate
computed by sequential Monte Carlo tightens the ELBO even further
(by giving it a new form), and moreover, the power of sequential Monte
Carlo allows for the efficient and full intertwining of the decoder and
encoder architectures even when either or both have internal latent
stochastic random variables. The goal and hope is that the variational
techniques combined with powerful inference algorithms will allow for
simultaneous model refinement/learning and inference compilation in
the full family of HOPPL-denotable models and beyond.
7.4 Nesting
π1 (y, z, D) = p(y)p(z|y)p(D|y, z)
= Beta(y; 2, 3)Γ(z; y, 1)N D; y, z 2
or
p(y)p(z|y)p(D|y, z)
π2 (y, z, D) = p(y)p(z|y, D) = R
p(z|y)p(D|y, z)dz
?
p(y)p(z|y)p(D|y, z)
= 6= π1 (z, y, D)
p(D|y)
The first interpretation is what you would expect if you were to inline
the inner query as one can do for a function body in a pure functional
language. While doing such a thing introduces no mathematical compli-
cations, it is incompatible with the conditional distribution semantics
we have established for HOPPL programs. The second interpretation is
correct in that it is in keeping with such semantics but introduces the
extra marginal probability term p(D|y) which is impossible to compute
in the general case, complicating matters rather a lot.
Rainforth (2018) categorizes nesting into three types: “sampling
from the conditional distribution of another query” (which we refer
to as nested inference), “factoring the trace probability of one query
with the partition function estimate of another” (which we refer to as
nested conditioning), and “using expectation estimates calculated using
7.4. Nesting 195
( observe ( normal x 1) 2)
x)
there are more exotic, interesting quasi-Borel spaces that do not arise
from this recipe.
Heunen et al.’s axiomatization regards a function f from a quasi-
Borel space (X, M ) to (Y, N ) as good if f ◦ r ∈ N for all r ∈ M ; in
words, f maps a random variable in M to a random variable in N . They
have shown that such good functions on (R, MR ) themselves form a
quasi-Borel space when equipped with a particular set of function-valued
random variables. Furthermore, they have proved that the application
function (fn [f x] (f x)) from above is a good function in their sense
because it maps a pair of such function-valued random variable and
R-valued random variable to a random variable in MR .
Heunen et al.’s axiomatization has been used to define the semantics
of probabilistic programming languages with higher-order functions and
also to validate inference algorithms for such languages. For interested
readers, we suggest Heunen et al. (2017) as well as Scibior et al. (2018).
8
Conclusion
201
202
inference in known and fixed models, and reporting state of the art
techniques for such one-shot inference, however we believe that the
challenge of model learning and rapid, approximate, repeated inference
are both of paramount importance, particularly for artificial intelligence
applications. Our belief is that probabilistic programming techniques,
and really more the practice of paying close attention to how language
design choices impact both what the end user can do easily and what
the evaluator can compute easily, should be considered throughout the
evolution of the next toolchain for artificial intelligence operations.
References
205
References 206
Duchi, J., E. Hazan, and Y. Singer (2011), ‘Adaptive subgradient methods for
online learning and stochastic optimization’. Journal of Machine Learning
Research 12(Jul), 2121–2159.
Friedman, D. P. and M. Wand (2008), Essentials of programming languages.
MIT press.
Ge, H., K. Xu, and Z. Ghahramani (2018), ‘Turing: A Language for Flexible
Probabilistic Inference’. In: A. Storkey and F. Perez-Cruz (eds.): Proceedings
of the Twenty-First International Conference on Artificial Intelligence and
Statistics, Vol. 84 of Proceedings of Machine Learning Research. Playa
Blanca, Lanzarote, Canary Islands, pp. 1682–1690, PMLR.
Gehr, T., S. Misailovic, and M. Vechev (2016), ‘PSI: Exact symbolic inference
for probabilistic programs’. In: International Conference on Computer Aided
Verification. pp. 62–83.
Gelman, A., J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B.
Rubin (2013), ‘Bayesian data analysis, 3rd edition’.
Geman, S. and D. Geman (1984), ‘Stochastic Relaxation, Gibbs Distributions,
and the Bayesian Restoration of Images’. IEEE Transactions on Pattern
Analysis and Machine Intelligence 6, 721–741.
Gershman, S. J. and N. D. Goodman (2014), ‘Amortized Inference in Proba-
bilistic Reasoning’. In: Proceedings of the Thirty-Sixth Annual Conference
of the Cognitive Science Society.
Ghahramani, Z. (2015), ‘Probabilistic machine learning and artificial intelli-
gence’. Nature 521(7553), 452–459.
Goodfellow, I., Y. Bengio, and A. Courville (2016), Deep Learning. MIT Press.
http://www.deeplearningbook.org.
Goodman, N., V. Mansinghka, D. M. Roy, K. Bonawitz, and J. B. Tenenbaum
(2008), ‘Church: a language for generative models’. In: Proc. 24th Conf.
Uncertainty in Artificial Intelligence (UAI). pp. 220–229.
Goodman, N. D. and A. Stuhlmüller (2014), ‘The Design and Implementation
of Probabilistic Programming Languages’. http://dippl.org. Accessed:
2017-8-22.
Google (2018), ‘Protocol Buffers’. [Online; accessed 15-Aug-2018].
Gordon, A. D., T. A. Henzinger, A. V. Nori, and S. K. Rajamani (2014),
‘Probabilistic programming’. In: Proceedings of the on Future of Software
Engineering. pp. 167–181.
References 208
Thrun, S. (2000), ‘Towards programming tools for robots that integrate prob-
abilistic computation and learning’. Proceedings 2000 ICRA. Millennium
Conference. IEEE International Conference on Robotics and Automation.
Symposia Proceedings (Cat. No.00CH37065) 1(April).
Todeschini, A., F. Caron, M. Fuentes, P. Legrand, and P. Del Moral (2014),
‘Biips: Software for Bayesian Inference with Interacting Particle Systems’.
arXiv preprint arXiv:1412.3779.
Tolpin, D., J. W. van de Meent, H. Yang, and F. Wood (2016), ‘Design and
implementation of probabilistic programming language Anglican’. arXiv
preprint arXiv:1608.05263.
Tran, D., M. D. Hoffman, R. A. Saurous, E. Brevdo, K. Murphy, and D. M. Blei
(2017), ‘Deep probabilistic programming’. arXiv preprint arXiv:1701.03757.
Tran, D., A. Kucukelbir, A. B. Dieng, M. Rudolph, D. Liang, and D. M.
Blei (2016), ‘Edward: A library for probabilistic modeling, inference, and
criticism’. arXiv preprint arXiv:1610.09787.
Tristan, J.-B., D. Huang, J. Tassarotti, A. C. Pocock, S. Green, and G. L. Steele
(2014), ‘Augur: Data-Parallel Probabilistic Modeling’. In: Z. Ghahramani, M.
Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (eds.): Advances
in Neural Information Processing Systems 27. Curran Associates, Inc., pp.
2600–2608.
Uber (2018), ‘Pyro’. [Online; accessed 15-Aug-2018].
van de Meent, J. W., B. Paige, D. Tolpin, and F. Wood (2016), ‘Black-box policy
search with probabilistic programs’. In: Proceedings of the 19th International
conference on Artificial Intelligence and Statistics, Vol. 41 of JMLR: W&CP.
pp. 1195–1204.
Van Der Merwe, R., A. Doucet, N. De Freitas, and E. Wan (2000), ‘The
unscented particle filter’. In: Advances in Neural Information Processing
Systems. pp. 584–590.
Webb, S., A. Golinski, R. Zinkov, N. Siddharth, T. Rainforth, Y. W. Teh,
and F. Wood (2017), ‘Faithful Inversion of Generative Models for Effective
Amortized Inference’. arXiv preprint arXiv:1712.00287.
Whiteley, N., A. Lee, K. Heine, et al. (2016), ‘On the role of interaction in
sequential Monte Carlo algorithms’. Bernoulli 22(1), 494–529.
Wikipedia contributors (2018), ‘Pattern Matching’. [Online; accessed 14-Aug-
2018].
References 215