Introduction to probabilistic programming
David Duvenaud and James Lloyd
University of Cambridge
Thanks to
Daniel M Roy (Cambridge)
1 / 11
H OW TO WRITE A BAYESIAN MODELING PAPER
1. Write down a generative model in an afternoon
2. Get 2 grad students to implement inference for a month
3. Use technical details of inference to pad half of the paper
2 / 11
C AN WE DO BETTER ?
Example: Graphical Models
Application Papers
1. Write down a graphical model
2. Perform inference using general-purpose software
3. Apply to some new problem
Inference papers
1. Identify common structures in graphical models (e.g. chains)
2. Develop efficient inference method
3. Implement in a general-purpose software package
Modeling and inference have been disentangled
3 / 11
E XPRESSIVITY
Not all models are graphical models
What is the largest class of models available?
Probabilistic Programs
I
A probabilistic program (PP) is any program that can depend on random
choices. Can be written in any language that has a random number generator.
You can specify any computable prior by simply writing down a PP that
generates samples.
A probabilistic program implicitly defines a distribution over its output.
4 / 11
A N E XAMPLE P ROBABILISTIC P ROGRAM
1
2
3
4
5
6
flip = rand < 0.5
i f flip
x = randg + 2
else
x = randn
end
% Random draw from Gamma(1,1)
% Random draw from standard Normal
5 / 11
A N E XAMPLE P ROBABILISTIC P ROGRAM
1
2
3
4
5
6
flip = rand < 0.5
i f flip
x = randg + 2
else
x = randn
end
% Random draw from Gamma(1,1)
% Random draw from standard Normal
Implied distributions over variables
0.035
0.03
0.8
0.025
p(x)
p(flip) 0.6
0.02
0.015
0.4
0.01
0.2
0.005
0
0
0
10
flip
5 / 11
A N E XAMPLE P ROBABILISTIC P ROGRAM
1
2
3
4
5
6
flip = rand < 0.5
i f flip
x = randg + 2
else
x = randn
end
% Random draw from Gamma(1,1)
% Random draw from standard Normal
Implied distributions over variables
1
0.8
p(flip) 0.6
0.4
0.2
0
0
flip
5 / 11
P ROBABILISTIC P ROGRAMMING : C ONDITIONING
Once weve defined a prior, what can we do with it?
The stochastic program defines joint distribution P(D, N, H)
I
D to be the subset of variables we observe (condition on)
H the set of variables were interested in
N the set of variables that were not interested in, (so well integrate them out).
We want to know about P(H|D)
Probabilistic Programming
I
Usually refers to doing conditionial inference when a probabilistic program
specifies your prior.
Could also be described as automated inference given a model specified by a
generative procedure.
6 / 11
A N E XAMPLE P ROBABILISTIC P ROGRAM :
C ONDITIONING
1
2
3
4
5
6
flip = rand < 0.5
i f flip
x = randg + 2
else
x = randn
end
% Random draw from Gamma(1,1)
% Random draw from standard Normal
Implied distributions over variables
1
0.8
p(flip) 0.6
0.4
0.2
0
0
flip
7 / 11
A N E XAMPLE P ROBABILISTIC P ROGRAM :
C ONDITIONING
2
3
4
5
6
flip = rand < 0.5
i f flip
x = randg + 2
else
x = randn
end
% Random draw from Gamma(1,1)
% Random draw from standard Normal
Implied distributions over variables
1
0.8
p(flip|2 < x < 3)
0.6
0.4
0.2
0
0
flip
7 / 11
C AN WE DEVELOP GENERIC INFERENCE FOR ALL PP S ?
Rejection sampling
1. Run the program with a fresh source of random numbers
2. If condition D is true, record H as a sample, else ignore the sample
3. Repeat
Example
1
2
3
4
5
6
flip = rand < 0.5
i f flip
x = randg + 2
else
x = randn
end
This produces samples over the execution trace
e.g.
8 / 11
C AN WE DEVELOP GENERIC INFERENCE FOR ALL PP S ?
Rejection sampling
1. Run the program with a fresh source of random numbers
2. If condition D is true, record H as a sample, else ignore the sample
3. Repeat
Example
1
2
3
4
5
6
flip = rand < 0.5
i f flip
x = randg + 2
else
x = randn
end
True
This produces samples over the execution trace
e.g.
8 / 11
C AN WE DEVELOP GENERIC INFERENCE FOR ALL PP S ?
Rejection sampling
1. Run the program with a fresh source of random numbers
2. If condition D is true, record H as a sample, else ignore the sample
3. Repeat
Example
1
2
3
4
5
6
flip = rand < 0.5
i f flip
x = randg + 2
else
x = randn
end
True
2.7
This produces samples over the execution trace
e.g. (True, 2.7),
8 / 11
C AN WE DEVELOP GENERIC INFERENCE FOR ALL PP S ?
Rejection sampling
1. Run the program with a fresh source of random numbers
2. If condition D is true, record H as a sample, else ignore the sample
3. Repeat
Example
1
2
3
4
5
6
flip = rand < 0.5
i f flip
x = randg + 2
else
x = randn
end
This produces samples over the execution trace
e.g. (True, 2.7),
8 / 11
C AN WE DEVELOP GENERIC INFERENCE FOR ALL PP S ?
Rejection sampling
1. Run the program with a fresh source of random numbers
2. If condition D is true, record H as a sample, else ignore the sample
3. Repeat
Example
1
2
3
4
5
6
flip = rand < 0.5
i f flip
x = randg + 2
else
x = randn
end
True
This produces samples over the execution trace
e.g. (True, 2.7),
8 / 11
C AN WE DEVELOP GENERIC INFERENCE FOR ALL PP S ?
Rejection sampling
1. Run the program with a fresh source of random numbers
2. If condition D is true, record H as a sample, else ignore the sample
3. Repeat
Example
1
2
3
4
5
6
flip = rand < 0.5
i f flip
x = randg + 2
else
x = randn
end
True
3.2
This produces samples over the execution trace
e.g. (True, 2.7),
8 / 11
C AN WE DEVELOP GENERIC INFERENCE FOR ALL PP S ?
Rejection sampling
1. Run the program with a fresh source of random numbers
2. If condition D is true, record H as a sample, else ignore the sample
3. Repeat
Example
1
2
3
4
5
6
flip = rand < 0.5
i f flip
x = randg + 2
else
x = randn
end
This produces samples over the execution trace
e.g. (True, 2.7),
8 / 11
C AN WE DEVELOP GENERIC INFERENCE FOR ALL PP S ?
Rejection sampling
1. Run the program with a fresh source of random numbers
2. If condition D is true, record H as a sample, else ignore the sample
3. Repeat
Example
1
2
3
4
5
6
flip = rand < 0.5
i f flip
x = randg + 2
else
x = randn
end
True
This produces samples over the execution trace
e.g. (True, 2.7),
8 / 11
C AN WE DEVELOP GENERIC INFERENCE FOR ALL PP S ?
Rejection sampling
1. Run the program with a fresh source of random numbers
2. If condition D is true, record H as a sample, else ignore the sample
3. Repeat
Example
1
2
3
4
5
6
flip = rand < 0.5
i f flip
x = randg + 2
else
x = randn
end
True
2.1
This produces samples over the execution trace
e.g. (True, 2.7), (True, 2.1),
8 / 11
C AN WE DEVELOP GENERIC INFERENCE FOR ALL PP S ?
Rejection sampling
1. Run the program with a fresh source of random numbers
2. If condition D is true, record H as a sample, else ignore the sample
3. Repeat
Example
1
2
3
4
5
6
flip = rand < 0.5
i f flip
x = randg + 2
else
x = randn
end
This produces samples over the execution trace
e.g. (True, 2.7), (True, 2.1),
8 / 11
C AN WE DEVELOP GENERIC INFERENCE FOR ALL PP S ?
Rejection sampling
1. Run the program with a fresh source of random numbers
2. If condition D is true, record H as a sample, else ignore the sample
3. Repeat
Example
1
2
3
4
5
6
flip = rand < 0.5
i f flip
x = randg + 2
else
x = randn
end
False
This produces samples over the execution trace
e.g. (True, 2.7), (True, 2.1),
8 / 11
C AN WE DEVELOP GENERIC INFERENCE FOR ALL PP S ?
Rejection sampling
1. Run the program with a fresh source of random numbers
2. If condition D is true, record H as a sample, else ignore the sample
3. Repeat
Example
1
2
3
4
5
6
flip = rand < 0.5
i f flip
x = randg + 2
else
x = randn
end
False
-1.3
This produces samples over the execution trace
e.g. (True, 2.7), (True, 2.1),
8 / 11
C AN WE DEVELOP GENERIC INFERENCE FOR ALL PP S ?
Rejection sampling
1. Run the program with a fresh source of random numbers
2. If condition D is true, record H as a sample, else ignore the sample
3. Repeat
Example
1
2
3
4
5
6
flip = rand < 0.5
i f flip
x = randg + 2
else
x = randn
end
This produces samples over the execution trace
e.g. (True, 2.7), (True, 2.1),
8 / 11
C AN WE DEVELOP GENERIC INFERENCE FOR ALL PP S ?
Rejection sampling
1. Run the program with a fresh source of random numbers
2. If condition D is true, record H as a sample, else ignore the sample
3. Repeat
Example
1
2
3
4
5
6
flip = rand < 0.5
i f flip
x = randg + 2
else
x = randn
end
False
This produces samples over the execution trace
e.g. (True, 2.7), (True, 2.1),
8 / 11
C AN WE DEVELOP GENERIC INFERENCE FOR ALL PP S ?
Rejection sampling
1. Run the program with a fresh source of random numbers
2. If condition D is true, record H as a sample, else ignore the sample
3. Repeat
Example
1
2
3
4
5
6
flip = rand < 0.5
i f flip
x = randg + 2
else
x = randn
end
False
2.3
This produces samples over the execution trace
e.g. (True, 2.7), (True, 2.1), (False, 2.3), . . .
8 / 11
C AN WE BE MORE EFFICIENT ?
Metropolis-Hastings
1. Start with a trace
I
(True, 2.3)
9 / 11
C AN WE BE MORE EFFICIENT ?
Metropolis-Hastings
1. Start with a trace
I
(True, 2.3)
2. Change one random decision, discarding subsequent decisions
I
(False,)
9 / 11
C AN WE BE MORE EFFICIENT ?
Metropolis-Hastings
1. Start with a trace
I
(True, 2.3)
2. Change one random decision, discarding subsequent decisions
I
(False,)
3. Sample subsequent decisions
I
(False, -0.9)
9 / 11
C AN WE BE MORE EFFICIENT ?
Metropolis-Hastings
1. Start with a trace
I
(True, 2.3)
2. Change one random decision, discarding subsequent decisions
I
(False,)
3. Sample subsequent decisions
I
(False, -0.9)
4. Accept with appropriate (RJ)MCMC acceptance probability
I
Reject, does not satisfy observation (i.e. likelihood is zero)
9 / 11
C AN WE BE MORE EFFICIENT ?
Metropolis-Hastings
1. Start with a trace
I
(True, 2.3)
10 / 11
C AN WE BE MORE EFFICIENT ?
Metropolis-Hastings
1. Start with a trace
I
(True, 2.3)
2. Change one random decision, discarding subsequent decisions
I
(True, 2.9)
10 / 11
C AN WE BE MORE EFFICIENT ?
Metropolis-Hastings
1. Start with a trace
I
(True, 2.3)
2. Change one random decision, discarding subsequent decisions
I
(True, 2.9)
3. Sample subsequent decisions
I
Nothing to do
10 / 11
C AN WE BE MORE EFFICIENT ?
Metropolis-Hastings
1. Start with a trace
I
(True, 2.3)
2. Change one random decision, discarding subsequent decisions
I
(True, 2.9)
3. Sample subsequent decisions
I
Nothing to do
4. Accept with appropriate (RJ)MCMC acceptance probability
I
Accept, maybe
10 / 11
PP VIA M ETROPOLIS -H ASTINGS - NOTATION
Evaluating a program results in a sequence of random choices
x1 pt1 (x1 ).
x2 pt2 (x2 |x1 ).
x3 pt3 (x3 |x2 , x1 ).
xk ptk (xk | x1 , . . . , xk1 ).
{z
}
|
execution trace
The density/probability of a particular evaluation is then
p(x1 , . . . , xK ) =
K
Y
ptk (xk |x1 , . . . , xk1 ).
k=1
We then perform MH over the the execution trace x = (x1 , . . . , xK )
11 / 11
MH OVER EXECUTION TRACES
1. Select a random decision in the execution trace x
I
e.g. xk
2. Propose a new value
I
e.g. xk0 Ktk (xk0 |xk )
3. Run the program to determine all subsequent choices (xl0 : l > k), reusing
current choices where possible
4. Propose moving from the state (x1 , . . . , xK ) to (x1 , . . . , xk1 , xk0 , . . . , xK0 0 )
{z
} | {z }
|
old choices
new choices
5. Accept the change with the appropriate MH acceptance probability
Q 0
0
Ktk (xk |xk0 ) Ki=k pti0 (xi0 |x1 , . . . , xk1 , xk0 , . . . , xi1
)
Q
K
0
Ktk (xk |xk ) i=k pti (xi |x1 , . . . , xk1 , xk , . . . , xi1 )
12 / 11
D EMO : R EGRESSION WITH AN I NTERESTING P RIOR
13 / 11
N ONPARAMETRIC MODELS
If we can sample from the prior of a nonparametric model using finite resources
with probability 1, then we can perform inference automatically using the
techniques described thus far
We can sample from a number of nonparametric processes/models with finite
resources (with probability 1) using a variety of techniques
I
I
I
Gaussian processes via marginalisation
Dirichlet processes via stick breaking
Indian Buffet processes via urn schemes
Active research to produce finite sampling algorithms for other nonparametric
processes (e.g. hierarchical beta processes, negative binomial process)
14 / 11
E XAMPLE : M IXTURE OF G AUSSIANS
Generative model
(Pseudo) MATLAB code
(k )k=1...K
iid
N (0, 1)
(k )k=1...K
Dir(/K)
:=
(n )n=1...N
iid
(xi )n=1...N
N (n , 1)
K
X
k k
k=1
mu = randn(K,1)
pi = dirichlet(K, alpha/K)
for n = 1:N
theta = mu(mnrnd(1,pi))
x(n) = theta + randn
end
15 / 11
E XAMPLE : I NFINITE MIXTURE OF G AUSSIANS
Limit of generative model is a DP
iid
(k )k=1...K N (0, 1)
(k )k=1...K Dir(/K)
:=
K
X
k=1
k k DP(, N (0, 1))
K
Avoiding infinity
I
is now infinitely complex, and can only be represented approximately with
finite resources
However, we can sample a finite number of samples (n )n=1...N from some
unknown in finite time (with probability 1) using a stick-breaking algorithm
16 / 11
E XAMPLE : I NFINITE MIXTURE OF G AUSSIANS
MATLAB stick breaking construction
1
2
3
4
5
6
7
8
9
10
11
sticks = [];
atoms = [];
f o r i = 1:n
p = rand;
w h i l e p > sum(sticks)
sticks(end+1) = (1-sum(sticks)) * betarnd(1, alpha);
atoms(end+1) \ = randn;
end
theta(i) = atoms( f i n d (cumsum(sticks)>=p, 1, first));
end
x = theta + randn(n, 1);
17 / 11
D EMOS : N ONPARAMETRIC M ODELS
18 / 11
A DVANCED AUTOMATIC I NFERENCE
Now that we have separated inference and model design, can use any inference
algorithm.
Free to develop inference algorithms independently of specific models.
Once graphical models identified as a general class, many model-agnostic
inference methods:
I
I
I
I
Belief Propagation
Pseudo-likelihood
Mean-field Variational
MCMC
What generic inference algorithms can we implement for more expressive
generative models?
19 / 11
A DVANCED AUTOMATIC I NFERENCE : G IBBS
I
BUGS: Bayesian inference Using Gibbs Sampling
I An early, limited form of automated inference in generative models.
I Began in 1989 in the MRC Biostatistics Unit, Cambridge, UK.
I A workhorse of applied statisticians. Also JAGS (open-source)
model{
for( i in 1 : N ) {
S[i] ~ dcat(pi[])
mu[i] <- theta[S[i]]
x[i] ~ dpois(mu[i])
for (j in 1 : C) {
SC[i, j] <- equals(j, S[i])}}
# Precision Parameter
alpha~ dgamma(0.1,0.1)
# Constructive DPP
p[1] <- r[1]
for (j in 2 : C) {
p[j] <- r[j] * (1 - r[j - 1]) * p[j -1 ] / r[j - 1]}
p.sum <- sum(p[])
for (j in 1:C){
theta[j] ~ dgamma(A, B)
r[j] ~ dbeta(1, alpha)
# scaling to ensure sum to 1
pi[j] <- p[j] / p.sum }
# hierarchical prior on theta[i] or preset parameters
A ~ dexp(0.1)
B ~dgamma(0.1, 0.1)
# total clusters
K <- sum(cl[])
for (j in 1 : C) {
sumSC[j] <- sum(SC[ , j])
cl[j] <- step(sumSC[j] -1)}}
Data:
list(x=c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,
2,3, 3, 3, 3, 4, 4, 5, 5, 5, 6, 6, 6,7,7,7,8,9,9,10, 10,
20 / 11
A DVANCED AUTOMATIC I NFERENCE :
M ETROPOLIS -H ASTINGS
I
Bher, MIT-Church
(Goodman, Mansinghka, Roy, Bonawitz and Tenenbaum, 2008)
I
(Automatic inference in Scheme)
Stochastic Matlab
I
Lightweight Implementations of Probabilistic Programming Languages
Via Transformational Compilation
(Wingate, Stuhlmller, Goodman, 2011)
21 / 11
A DVANCED AUTOMATIC I NFERENCE : HMC
I
Automatic Differentiation in Church:
Nonstandard Interpretations of Probabilistic Programs for Efficient Inference
(Wingate, Goodman, Stuhlmuller, Siskind, 2012)
Stan (Gelman et al)
http://mc-stan.org/
// Predict from Gaussian Process Logistic Regression
// Fixed covar function: eta_sq=1, rho_sq=1, sigma_sq=0.1
data {
int<lower=1> N1;
vector[N1] x1;
int<lower=0,upper=1> z1[N1];
int<lower=1> N2;
vector[N2] x2;}
transformed data {
int<lower=1> N;
vector[N1+N2] x;
vector[N1+N2] mu;
cov_matrix[N1+N2] Sigma;
N <- N1 + N2;
for (n in 1:N1) x[n] <- x1[n];
for (n in 1:N2) x[N1 + n] <- x2[n];
for (i in 1:N) mu[i] <- 0;
for (i in 1:N)
for (j in 1:N)
Sigma[i,j] <- exp(-pow(x[i] - x[j],2))
+ if_else(i==j, 0.1, 0.0);}
parameters {
vector[N1] y1;
vector[N2] y2;}
model {
vector[N] y;
for (n in 1:N1) y[n] <- y1[n];
for (n in 1:N2) y[N1 + n] <- y2[n];
22 / 11
A DVANCED AUTOMATIC I NFERENCE : E XPECTATION
P ROPAGATION
I
Infer.NET (Minka, Winn, Guiver, Knowles, 2012)
I
I
EP in graphical models:
Now works in functional language F#:
23 / 11
A DVANCED AUTOMATIC I NFERENCE : VARIATIONAL
I
Infer.NET has it too.
Automated Variational Inference in Probabilistic Programming
(Wingate, Weber, 2012)
Learning phase: Forward sample, then stochastically update s to minimize
expected KL from true distribution.
Dependency of variatonal dist on control logic remains.
24 / 11
A DVANCED AUTOMATIC I NFERENCE : H ARDWARE
I
I
I
Natively Probabilistic Computation (Mansinghka, 2009)
Lyric Semiconductor? (Error correcting codes)
Main idea: If we know were going to be sampling, some errors in computation
can be OK.
I Samplers can be made robust to computational error.
I Run at low voltage on (cheap?) FPGA
Compile from generative model to FPGA (9x9 Ising model sampler):
25 / 11
AUTOMATED M ODELING
Automated inference helpful for human modelers.
Essential for machine-generated models
I
For example, approximate Solomonoff induction.
Essential for more general version of automated Bayesian statistician.
26 / 11
T HEORETICAL D IRECTIONS
Inference in stochastic programs opens up a new branch of computer science, new
generalizations of computability:
I
"Computable de Finetti measures"
(Freer, Roy, 2012)
"Noncomputable conditional distributions"
(Ackerman, Freer, Roy, 2011)
"Computable exchangeable sequences have computable de Finetti measures"
(Freer, Roy, 2009
"On the computability and complexity of Bayesian reasoning"
(Roy, 2012)
Main takeaways:
I
No general computable algorithm exists for conditioning
Representational choices important
I
i.e. stick-breaking vs CRP latent representation changes computability
27 / 11
C OMPILER D EVELOPMENT
I
1950s: Ask a programmer to implement an algorithm efficiently: Theyll write
it on their own in assembly.
I
1970s: Novice programmers use high-level languages and let compiler work
out details, experts still write assembly.
I
No good compilers; problem-dependent optimizations that only human
expert could see.
Experts still write custom assembly when speed critical.
2000s: On most problems, even experts cant write faster assembly than
optimizing compilers.
I
I
can automatically profile (JIT).
can take advantage of paralellization, complicated hardware, make
appropriate choices w.r.t. caching.
Compilers embody decades of compiler research
28 / 11
I NFERENCE M ETHODS IN THE F UTURE
I
2010: Ask a grad student to implement inference efficiently: Theyll write it on
their own.
I
2015: Novice grad students use automatic inference engines and let compiler
work out details, experts still write their own inference.
I
No good automatic inference engines; problem-dependent optimizations
that only human expert can see.
Experts still write custom inference when speed critical.
2020: On most problems, even experts cant write faster inference than mature
automatic inference engines.
I
I
I
Can use paralellization, sophisticated hardware
Can automatically choose appropriate methods (meta-reasoning?).
Inference engines will embody 1 decade (!) of PP research.
29 / 11
H OW TO WRITE A BAYESIAN MODELING PAPER : 2010
1. Write down a generative model in an afternoon
2. Get 2 grad students to implement inference for a month
3. Use technical details of inference to pad half of the paper
30 / 11
H OW TO WRITE A BAYESIAN MODELING PAPER : 2020
1. Write down a generative model in an afternoon
2. Get 2 grad students to implement inference for a month
Run automatic inference engine.
3. Use technical details of inference to pad half of the paper
31 / 11
H OW TO WRITE A BAYESIAN MODELING PAPER : 2020
1. Write down a generative model in an afternoon
2. Get 2 grad students to implement inference for a month
Run automatic inference engine.
3. Use technical details of inference to pad half of the paper
Discuss strengths and weaknesses of the model in the extra 4 pages.
32 / 11
H OW TO WRITE A BAYESIAN MODELING PAPER : 2020
1. Write down a generative model in an afternoon
2. Get 2 grad students to implement inference for a month
Run automatic inference engine.
3. Use technical details of inference to pad half of the paper
Discuss strengths and weaknesses of the model in the extra 4 pages.
Thanks!
32 / 11